Methods and systems for searching genomic databases

ABSTRACT

The instant invention provides methods and systems for searching genomic databases using polypeptide sequence information, such as those obtained from peptide sequencing projects, especially those using mass spectrometers. According to the instant invention, polypeptide sequences can be reverse translated into multiple sequence tags which are then used to search for identical or similar sequences in genomic databases, such as unanotated genomic databases of human or other organisms. Alternatively, the polypeptide sequences can be directly compared to sequences translated from at least 3, preferably all 6 reading frames of genomic sequences. The instant invention also provides systems for performing the methods of the instant invention, including computer systems, and systems including said computer systems and mass spectrometers linked to said computer systems. The instant invention further provides methods of conducting proteomic businesses using the methods of the instant invention.

REFERENCES TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional ApplicationNo. 60/282,551, filed on Apr. 9, 2001, and U.S. Provisional ApplicationNo. 60/285,362, filed on Apr. 20, 2001, the specifications of which areincorporated herein by reference in their entireties.

BACKGROUND OF THE INVENTION

[0002] Systematic analysis of the function of genes can take place atthe oligonucleotide or protein level. The latter has the advantage ofbeing closest to function, since it is proteins that perform most of thereactions necessary for the cell. For most protein based (“proteomic”)approaches to gene function, mass spectrometry is the method of choice.Mass spectrometry can now identify proteins with very high sensitivityand medium to high throughput. New instrumentation for the analysis ofthe proteome has been developed including a MALDI hybrid quadrupole timeof flight instrument which combines advantages of the mass fingerprinting and peptide sequencing methods for protein identification. Newapproaches include the isotopic labeling of proteins to obtain accuratequantitative data by mass spectrometry, methods to analyze peptidesderived from crude protein mixtures and approaches to analyze largenumbers of intact proteins by mass spectrometry directly.

[0003] To date, only protein sequence databases (usually auto-translatedfrom DNA sequence data) and expressed sequence tag databases have beensearched by mass spectrometric data. It has been shown that virtuallyall human proteins which can be visualized on gels can be identified inthe expressed sequence tag databases, which now contain more than twomillion single read cDNA sequences. Neubauer et al. (1998) Nat. Genet.20:46. However, ESTs usually cover only part of a protein sequence. Ifit were possible to work directly with genomic sequence databases, thiswould in principle allow for the identification of every peptide onwhich mass spectrometric sequence information was obtained. Difficultiesin genome searching include the large size of the human genome and thefact that only a few percent are protein coding. Additionally, theexon-intron structure of most genes cannot be accurately predicted bybioinformatics. Krogh (1998) in: Guide to Human Genome Computing(Bishop, M. J., Ed.), pp. 26h1-274, Academic Press, San Diego; andDunham et al. (1999) Nature 402:489.

SUMMARY OF THE INVENTION

[0004] The present invention provides a method for identifying a codingsequence in a genomic database, e.g., an unannotated genomic database,comprising:

[0005] (i) generating, for an input polypeptide sequence, a set ofsequence tags corresponding to possible coding sequences for the inputpolypeptide sequence; and

[0006] (ii) identifying, by an approximate string matching method usingsaid sequence tags, genomic sequences from a genomic database which aresimilar to one or more of the sequence tags.

[0007] In certain preferred embodiments, the genomic database is anunannotated genomic database. In another preferred embodiment, themethod also involves determining an open reading frame for the inputpolypeptide sequence in the genomic database, and, optionally,determining intron/exon boundaries in the open reading frame. In thatmanner, the subject method can be used to update, or provide across-referenced database, including coding sequence and intronicannotation for the genomic database.

[0008] In certain embodiments, the input polypeptide sequence isprovided from a system for protein sequencing by mass spectrometry. Forinstance, the subject method can be performed by a computer which has adata link from a mass spectrometer system for transmitting the inputpolypeptide sequence.

[0009] In certain embodiments, the approximate string matching method isselected from the group consisting of a Shift-And method, a Karp-Rabinfingerprint method, and a Commentz-Walter method. In certain preferredembodiments, the approximate string matching method is a GREP method,such as an AGREP method. In general, the approximate string matchingmethod will be one which tolerates a maximal number of errors, such asgaps for intronic sequence, of a size equal to at least the averagelength of intronic sequences in the genomic database.

[0010] In certain embodiments, the approximate string matching methodhas an error ratio, a, is less than 3.0, and even more preferably lessthan 1.0.

[0011] In certain embodiments, the subject method is carried out withmultiple sequence tags, e.g., the multiple sequence tags are combinedinto a single array which is used as the input for the approximatestring matching method.

[0012] Another aspect of the present invention provides a method foridentifying a coding sequence in an unannotated genomic database,comprising:

[0013] (i) receiving an input polypeptide sequence; and

[0014] (ii) identifying, by an approximate string matching method usingsaid input polypeptide sequence, coding sequences from a genomicdatabase which has been dynamically translated in at least 3 readingframes.

[0015] Still another aspect of the present invention provides a computersystem for identifying coding sequences in genomic databases,comprising:

[0016] (i) a sub-system for calculating and/or storing potential codingsequences for a polypeptide;

[0017] (ii) one or more databases of genomic sequence; and

[0018] (iii) an ID program for performing approximate string matchingbetween nucleic acid sequences in a manner which accounts fordifferences between the two sequences due to an intronic sequence;

[0019] wherein, the system generates a set of sequence tagscorresponding to possible coding sequences for an input polypeptidesequence, and identifies, from the database, any genomic sequences whichare similar to one or more of the sequence tags, and indicatesexon/intron boundaries, if any, in the genomic sequence(s).

[0020] For instance, in certain embodiments, the computer system alsoincludes a sample/identification proteomics database for logging andcorrelating information such as sample identity, gel photos, massspectra (and features therein), and search results.

[0021] The subject system can also include a transfer system to automatethe transfer and utilization of mass spectrometric data of a targetpolypeptide.

[0022] Still another aspect of the present invention provides a massspectrometry system including the above computer system and a massspectrometer for sequencing polypeptides. For example, the spectrometermay include an ion source selected from the group consisting ofelectrospray and MALDI.

[0023] Yet another aspect of the present invention relates to a methodof conducting a proteomics business, comprising:

[0024] (i) by the above-described method, determining the identity of atarget gene encoding a protein isolated on the basis of the proteinbeing (a) involved in an interaction of interest, (b) having a cellularlocalization of interest, (c) having a differential expression patternof interest, or (d) being post-translationally modified;

[0025] (ii) identifying agents by their ability to alter the level ofexpression of the target gene or the activity of an expression productof the target gene;

[0026] (iii) conducting therapeutic profiling of agents identified instep (b), or further analogs thereof, for efficacy and toxicity inanimals; and

[0027] (iv) formulating a pharmaceutical preparation including one ormore agents identified in step (iii) as having an acceptable therapeuticprofile.

[0028] The subject business method can include the additional step ofestablishing a distribution system for distributing the pharmaceuticalpreparation for sale, and may optionally include establishing a salesgroup for marketing the pharmaceutical preparation.

[0029] Still another aspect of the present invention provides a methodof conducting a proteomics business, comprising:

[0030] (i) by the above-described method, determining the identity of atarget gene encoding a protein isolated on the basis of the proteinbeing (a) involved in an interaction of interest, (b) having a cellularlocalization of interest, (c) having a differential expression patternof interest, or (d) being post-translationally modified;

[0031] (ii) (optionally) conducting therapeutic profiling of the targetgene for efficacy and toxicity in animals; and

[0032] (iii) licensing, to a third party, the rights for further drugdevelopment of inhibitors or activators of the target gene.

BRIEF DESCRIPTION OF THE DRAWINGS

[0033]FIG. 1: Scheme representing the steps which can be used toidentify a gene locus from MS sequencing of a protein.

[0034]FIG. 2: Illustration of how MALDI sequence data can be used toextend exon coverage (SEQ ID NOS: 91-100).

[0035]FIG. 3: Comparison of performance of various sequence analysisalgorithms with respect to predicting gene structure (SEQ ID NOS:101-102).

[0036]FIG. 4: Two sequences retrieved from the human genome by theindicated peptide sequence tag. Correlation of calculated Y-ion seriesof the two sequences with the tandem MS spectrum reveals that only onesequence can be correct (SEQ ID NOS: 103-104).

[0037]FIG. 5: Demonstrates the use of MS/MS and genome identification toelucidate the gene structure of a novel human protein (SEQ ID NOS:105-112).

[0038]FIG. 6: Schematic representation of one preferred embodiment forinformation flow.

[0039]FIG. 7: Proposed information flow. All relevant information isstored in ProteomeDB, and unique Sample ID numbers are given. Links mayalso go directly to ProteomeDB from ProLogDB, ProAutoDB and theProspects agents (not shown) in order to enter information withoutoperator assistance.

[0040]FIG. 8: Main switchboard.

[0041]FIG. 9: Tables in ProteomeDB.

[0042]FIG. 10: Forms in ProteomeDB. In general, these parameters shouldnot be modified unless by an administrator familiar with the databaseprogram such as MS Access.

[0043]FIG. 11: Reporting options in ProteomeDB. Reports can betransferred to a word processor (MS Word) by one button click andsubsequently saved as a separate file (e.g. in rich text format) foreasy distribution of analytical results via electronic means such ase-mail.

[0044]FIG. 12: ProteomeDB interface form.

[0045]FIG. 13: Search parameter window for peptide map queries.

[0046]FIG. 14: Search parameter window for peptide sequence tag queries.

[0047]FIG. 15: Search parameter window for breakpoint queries.

[0048]FIG. 16: Search parameter window for amino acid sequence queries.

[0049]FIG. 17: Sample information dialogue box.

[0050]FIG. 18: Search parameter window for automating ID program viaProAutoDB.

[0051]FIG. 19: Search parameter window for logging searches.

[0052]FIG. 20: Multi template interface.

[0053]FIG. 21: Search results window.

[0054]FIG. 22: 2nd pass check windows.

[0055]FIG. 23: Database entry window.

[0056]FIG. 24: Result summary window.

[0057]FIG. 25: ProLogDB browser window.

[0058]FIG. 26: Conversion of DNA sequence to amino acid sequence.

[0059]FIG. 27: Search parameter window for calculating theoreticalfragment masses from a peptide sequence.

[0060]FIG. 28: Search parameter window for calculating theoreticalpeptide masses.

DETAILED DESCRIPTION OF THE INVENTION

[0061] I. Overview

[0062] Large-scale DNA sequencing efforts are yielding the DNA blueprintof the human genome as well as of other organisms, and attention is nowshifting to the systematic functional analysis of the biologicalinformation encoded by the genomes. Once aspect of these proteomicsefforts utilizes mass spectrometry (MS) based protein identification,and relies on directly obtaining the sequence for a sample protein.While EST and other coding sequence libraries have been utilized toobtain identification of protein from a partial protein sequences, untilthe present invention, it had been unclear whether a genomic sequenceinformation itself would be useful in the same way because of the vastsize of genomes of higher organisms, the complex exon/intron structureof genes, and the large percentage of non-coding sequence. Thesefeatures have made it very difficult to predict coding regions withcertainty.

[0063] One aspect of the present invention is related to thedemonstration that proteins can be directly sequenced using, e.g., massspectrometry, and the coding sequences (along with information aboutintronic structures) for the proteins unambiguously identified inunannotated genomes as large as the human genome. A salient feature tothe invention, as is described herein for embodiments utilizing massspectrometry sequencing (“MS sequencing”), is that the subject methodcan be carried out using small amounts of proteins, e.g., sub-nanomolamounts of a test protein, and more preferably sub-picomol amounts ofthe test protein. In particular, the present invention is based on thediscovery that a suitably modified pattern matching algorithm can beused with direct protein sequencing data to locate coding sequences inraw genomic data. In this regard, the mass spectrometric data can beused to predict the gene structure, such as intron/exon boundaries.

[0064] In one embodiment, the subject method is carried out as follows.Beginning with amino acid sequence data for a sample proteins, such asmay be provided using mass spectrometry, a set of degenerate nucleotidesequences (“reverse transcribed sequences” or “sequence tags”) arecalculated for the input polypeptide sequence. The set of sequence tagsrepresents all, or at least the most likely based on codon usage,nucleotide sequences which could encode the sample protein. Utilizingeach of the sequence tags, one or more similarity searches of a genomicdatabase(s) is carried out in “forward” and “reverse” directions toidentify similar sequence(s) in the genomic database. In certaininstances, the subject method will utilize a pattern matching algorithmin the search which accounts for gaps in the similarity between thesequence tag and the genomic sequence, e.g., which accounts for andidentifies the occurrence of intronic sequences which may disrupt thegenomic coding sequence for the sample protein. This may be carried oututilizing further sequencing data, or by calculating intron/exonboundaries using known rules for intron splicing, and, for example,knowledge of the molecular weight of an unmodified form of the sampleprotein.

[0065] In other embodiments, the subject method is carried out bypattern searching with the amino acid sequence for the sample protein,against a set (e.g., six) of genomic sequence databases representing thegenomic nucleotide sequence having been dynamically translated in allthree reading frames. That is, the pattern matching is done at the levelof actual amino acid sequence in a database of predicted amino acidsequences. As above, the subject method will preferably utilize apattern matching algorithm which accounts for gaps in the similaritybetween the amino acid sequence of the sample protein and thedynamically translated genomic sequence in order to allow for intronicsequences which have been carried into the dynamically translateddatabase in the form of non-sense amino acid sequence. For purposes ofclarity, the application will now describe the subject method in termsof the use of sequence tags and genomic nucleotide sequence databases,though it will be understood that comparison at the level of amino acidsequence and dynamically translated genomic databases can be readilyadapted for the various embodiments to be described.

[0066] In certain preferred embodiments, the subject method alsoutilizes homology searching to identify known, related proteins. Whereonly fragments of the sample protein have been sequenced, the sequenceof identified homologs can be used to predict the remaining codingsequence and, accordingly, the intronic structure of the gene. Thepresence of homologs of known function can, of course, also provideguidance to the potential function of the sample protein.

[0067] The size of the human genome is approximately 25 times that of A.thaliana but the coding sequence is expected to be only 2-3 timeslarger. Tryptic peptides of the size typically encountered in MSsequencing (>10 aa) are almost always unique in the human genome. Theinformation content of peptide sequence tags approximates that of thecomplete peptide sequence. In addition, the sequences retrieved by thesearch are checked against the tandem MS data which eliminates falsepositives. Therefore, searches using even short tags almost alwaysresult in unique identifications. Interestingly, the search specificityin the human genome is virtually identical to that of the dbEST but withthe added advantage of high sequence accuracy, low redundancy andunbiased coverage.

[0068] The following example illustrates the situation where peptidesequences correspond to coding regions within a gene. Referring to FIG.1, whenever a peptide sequence tag derived from a MS/MS spectrumunambiguously identifies the corresponding DNA sequence in the genome,this sequence must be part of an exon. The peptide therefore locates theexon as well as the correct reading frame. In-frame stop codons upstreamand downstream of the identified peptide also limit the extent of theexon within which the splice signals (exon intron boundaries) must befound. As described herein, Mass spectral data can be used to screen thevicinity of mapped regions for further exons. In many cases, peptidesspan two exons which enables the localization of the exact splice sitefor the two exons involved.

[0069] Typically, several peptides are partially sequenced during thecourse of a protein identification experiment using, for example, a massspectrometer. Subsequent database searches identify peptides whichcluster in a confined (2-15 kb) region of the genome which encompassesthe underlying gene. The identified peptides define reading frames whichin turn hold information about the intron/exon structure of the gene.Generally, two peptides are sufficient to identify and map therespective gene to its chromosomal location. Any of the identified exonscan be used as probes for cloning or for homology searching fortentative function assignment. The defined genome area can be used todirect sequencing of further peptides in the same experiment.

[0070] Most strategies for large scale protein identification follow atwo tier analytical approach in which first a MALDI peptide massfingerprint is created and samples that are not identified in this roundof analysis are subjected to partial sequencing by tandem MS. It shouldbe stressed that, owing to the complex structure of genes in higherorganisms, fingerprint data alone does not hold sufficientdiscriminating power to identify proteins directly in a genome. As shownin FIG. 2, however, once part of the coding sequence of a protein hasbeen found in the genome by peptide sequence tags, the 2-15 kb genomicsequence can be searched with the fingerprinting data by translating thenucleotide sequence in the three respective reading frames. Thereby,exon sequence coverage can be extended and additional exons can befound.

[0071] Computational gene prediction in genomic DNA of higher organismshas traditionally been very difficult but a combination of MS data andexon prediction can be very effective at defining gene structure. Thegenomic region identified in the previous figure was analyzed withGENSCAN, and GRAIL and compared to the known sequence of this protein.GENSCAN missed one exon and predicted a surplus one whereas GRAILpredicted two splice sites incorrectly. As shown in FIG. 3, MS data, inconjunction with the genome sequence, rectified the incorrect splicesites, led to inclusion of the exon that was missed and showed that thesurplus exon was not present. The extent to which a predicted gene modelcan be verified or refined by the MS data obviously depends on thenumber of exons actually identified by peptide sequence tags.

[0072] II. Definitions

[0073] The term “percent identity” refers to the degree which residuesin common at aligned positions between nucleic acid or amino acidsequences are said to be identical. For example, if they have 43residues out of a total of 144 in common they are 29.9% identical.

[0074] The term “genomic information” includes protein coding regions,introns and other non-coding sequences, and other such structures thatcommonly appear genomic sequences. It is also meant to include thereading frame for proteins as encoded by a gene.

[0075] A “nucleotide residue” refers to-the nucleotide found along apolynucleotide sequence. For example, in a DNA sequence, it is mean torefer to adenine (A); guanine (G); cytosine (C); and thymine (T). Forexample, in a RNA sequence, it is mean to refer to adenine (A); guanine(G); cytosine (C); and uracil (U). This term can also include mutatedand/or genetically engineered variations of nucleotide bases as areknown in the art.

[0076] “ORF” or “Open Reading Frame” is a nucleotide sequence whichcould be translated into a polypeptide. Such a stretch of sequence isuninterrupted by a stop codon. An ORF that represents the codingsequence for a full protein begins with an ATG “start” codon andterminates with one of the three “stop” codons. For the purposes of thisapplication, an ORF may be any part of a coding sequence, with orwithout start and/or stop codons. “ORF” and “CDS” may be usedinterchangeably.

[0077] The term “annotation” refers to the description of an ORF,introns and other genomic features.

[0078] A “contig” is a sequence derived by assembling two or moreoverlapping sequence fragments. For instance, a contig representing aportion of a CDS may be constructed by combining two or more overlappingEST sequences.

[0079] The term “allele” refers to alternative forms of a genetic locus;a single allele for each locus is inherited separately from each parent.The sequence of two alleles may identical or may different.

[0080] II. Methods for Pattern Matching

[0081] There are a variety of “pattern matching” or “approximate stringmatching” algorithms known in the art which can be readily adapted foruse in the present invention. One problem in finding the coding sequencefor a protein in genomic sequence databases can be formally stated asfollows: given a genomic sequence of length n, a sequence tag of lengthm, and a maximal number of errors (e.g., gaps for intronic sequence) ofk, find all segments of the genomic sequence (referred to herein as“occurrences” or “matches”) whose “edit distance” to the sequence tag isat most k. The edit distance between two sequences is defined as theminimum number of edit operations needed to transform one sequence intothe other. The allowed edits in the context of the present inventioninclude deleting, inserting and replacing nucleotide residues. Thegenomic sequence(s) and sequence tag(s) are sequences of characters froman alphabet Σ (of nucleotide residues) of σ. The error ratio, or errorlevel, α can be given by α=k/m.

[0082] For instance, similarity tools developed by Needelman & Wunch (J.Mol. Biol. 48:444-453, 1970) and Sellers (SLAM J Appl Math. 26:787-793,1974) can be used to calculate a global similarity score between theentire lengths of the sequences being compared. This type of algorithmis not sensitive for highly diverged sequences, but does not need to beso in most embodiments of the present method. Another available methodfocuses on shorter regions of local similarity. Examples of localsimilarity algorithms include the Smith-Waterman (J Mol Biol147:195-197, 1981), BLAST (Altschul et al, J Mol Biol 215:403-410,1990), and FASTA (Pearson and Lipman, PNAS 85:2444-2448, 1988).

[0083] In certain embodiments, the subject method uses a string matchingmethod based on bit operations or on arithmetic, rather than charactercomparisons. Some of the examples are the Shift-And method, Karp-Rabinfingerprint method, or the algorithm of Commentz-Walter (“A stringmatching algorithm fast on the average” Proc. 6th InternationalColloquium on Automata, Languages, and Programming (1979), pp. 118-132),which combines the Boyer-Moore technique with the Aho algorithm.

[0084] In preferred embodiments, the subject method utilizes a patternmatching algorithm from the GREP family. One method for solving thisproblem is the algorithm described by Aho et al. (“Efficient stringmatching”, Communications of the ACM 18 (June 1975), pp. 333-340) whichsolves the problem in linear time. This algorithm is the basis of fgrep.As described in further detail, an exemplary embodiment of the methodutilizes the AGREP algorithm, e.g., adapted from the teachings of Wu etal. (1992) Communications of the ACM, 35:83 and Wu et al. Proceedings ofthe Winter 1992 USENIX Conference San Francisco, 20-24. January 1992.pp. 153-162, Berkeley.

[0085] The AGREP algorithm is generally useful for problems where one issearching for a pattern P=p₁ p₂ . . . p_(m) inside a large text fileT=t₁ t₂ . . . t_(n). The pattern and the text are sequences ofcharacters from a finite character set Σ. In certain embodiments of thesubject method, the characters are DNA sequences, e.g., representingnucleotide bases, and are preferably genomic sequences. The AGREP methodis used to find all occurrences of the reverse transcribed sequence (an“Sequence”) P in genomic sequence T; namely, it is used to search forthe set of starting positions F={i|1≦i≦n−m+1 such that t_(i) t_(i+1) . .. t_(i+m−1)=P}.

[0086] (A) Exact Match

[0087] In certain embodiments, the subject method uses an extract stringmatching method. To illustrate, let R be a bit array of size m (the sizeof the pattern). The term R_(j) denotes the value of the array R afterthe j character of the nucleotide sequence has been processed. The arrayR_(j) contains information about all matches of prefixes of P that endat j. More precisely, R_(j)[i]=1 if the first i characters of thepattern match exactly the last i characters up to j in the sequence(i.e., p₁p₂ . . . p_(i)=t_(j−i+1)t_(j−i+2) . . . t_(j)). For each readt_(j+1) the method determines whether t_(j+1) can extend any of thepartial matches so far. For each i such that R_(j)[i]=1, the systemchecks whether t_(j+1) is equal to p_(i+1). If R_(j)[i]=0 then there isno match up to i and there cannot be a match up to i+1. If t_(j+1)=p₁then R_(j+1) [1]=1. If R_(j+1)[m]=1, then there is a complete match,starting at j−m+2, and the match is output, e.g., to a file, screen andor hardcopy. The transition from R_(j) to R_(j+1) can be summarized asfollows:

[0088] Initially, R₀[i]=0 for all i, 1≦i≦m; R₀[0]=1 (to avoid having aspecial case for i=1);${R_{j + 1}\lbrack i\rbrack} = \left\{ {\begin{matrix}1 & {{{{if}\quad {R_{j}\left\lbrack {i - 1} \right\rbrack}} = {{1\quad {and}\quad p_{i}} = t_{j + 1}}}} \\0 & {{otherwise}}\end{matrix};} \right.$

[0089] If R_(j+1)[m]=1, then output a match at j−m+2.

[0090] The main observation about this transition, due to Baeza-Yatesand Gonnet (in Proceedings of the 12th Annual ACM-SIGIR conference onInformation Retrieval, Cambridge, Mass. (June 1989), pp. 168-175), isthat it can be computed very fast in practice as follows. Let the“alphabet” be Σ=s₁, s₂, . . . , s_(|Σ|). For each character s_(i) in thenucleotide sequence, one constructs a bit array Si of size m such thatSi [r]=1 if pr=si. It is sufficient to construct the S arrays only forthe characters that appear in the pattern. In other words, S_(i) denotesthe indices in the pattern that contain si. It is easy to verify nowthat the transition from R_(j) to R_(j+1) amounts to no more than aright shift of R_(j) and an AND operation with S_(i), wheres_(i)=t_(j+1). So, each transition can be executed with only two simplearithmetic operations, a shift and an AND. Assume that the right shiftfills the first position with a 1. If only 0-filled shifts are available(as is the case with C), then the system can add one more OR operationwith a mask that has one bit. (Baeza-Yates and Gonnet, supra, used 0 toindicate a match and an OR operation instead of an AND; that way,0-filled shifts are sufficient.

[0091] (B) Matching With Errors

[0092] In more preferred embodiments, however, the subject methodutilizes an algorithm which tolerates errors (mismatches), e.g., forapproximate pattern matching between the sequence tag and genomicsequence(s). In one embodiment, the previously described method can beadapted to allow errors in matching. As a simple illustration, themethod can be adapted to permit one insertion into the pattern at anyposition. In other words, the method finds all intervals of size at mostm+1 in the genomic sequence that contain the pattern of the sequence tagas a subsequence. The R and S arrays are defined as before, but nowthere are two possibilities for each prefix match. There can be an exactmatch or a match with one insertion. Accordingly, another array isintroduced, denoted by R_(j) ¹ which indicates all possible matches upto t_(j) with at most one insertion. More precisely, R_(j) ¹[i]=1 if thefirst i nucleotides of the pattern match i of the last i+1 characters upto j in the sequence. If both R and R¹ are maintained, then all matchescan be found which have at most one insertion: R_(j)[m]=1 indicates thatthere is an exact match and R_(j) ¹[i]=1 indicates that there is a matchwith at most one insertion (sometimes both will equal to 1 at the sametime).

[0093] The transition for the R array is the same as before. One needonly to specify the transition for R¹. There are two cases for a matchwith at most one insertion of the first i characters of P up to t_(j+1):

[0094] I1. There is an exact match of the first i nucleotides up tot_(j). In this case, inserting t_(j+1) at the end of the exact matchcreates a match with one insertion.

[0095] I2. There is a match of the first i−1 nucleotides up to t_(j)with one insertion and t_(j+1)=p_(i). In this case, the insertion issomewhere inside the sequences and not at the end.

[0096] Case I1 can be handled by just copying the value of R to R¹, andcase I2 can be handled with a right shift of R¹ and an AND operationwith S_(i) such that s_(i)=t_(j+1). So, to compute R_(j) ¹, oneadditional shift (the shift of R is done already) is done, one ANDoperation and one OR operation.

[0097] In another embodiment, the method can allow for one deletionbetween the sequences (and no insertions). R, R¹ (which now indicatesone deletion), and S are as defined before. There are again two casesfor a match with at most one deletion of the first i characters of P upto t_(j+1):

[0098] D1. There is an exact match of the first i−1 characters up tot_(j+1) (which is indicated by the new value of the R array R_(j+1)[i−1]). This case corresponds to deleting pi and matching the first i−1characters.

[0099] D2. There is a match of the first i−1 characters up to tj withone deletion and tj+1=pi. In this case, the deletion is somewhere insidethe sequence and not at the end. Case D2 is handled as before (it isexactly the same), and case D1 is handled by a right shift of the newvalue of Rj+1.

[0100] In still another embodiments, the method can allow for asubstitution. That is, it allows for replacing one nucleotide of P withone nucleotide of T. Again, there are two cases:

[0101] S1. There is an exact match of the first i-1 nucleotides up tot_(j). This case corresponds to substituting t_(j+1) with p_(i) (whetheror not they are equal—the equality will be indicated in R) and matchingthe first i−1 nucleotides.

[0102] S2. There is a match of the first i-1 nucleotides up to t_(j)with one substitution and t_(j+1)=p_(i). In this case, the substitutionis somewhere inside the sequence and not at the end

[0103] Case S2 is again the same. Case S1 corresponds to looking atR_(j)[i−1] as opposed to looking at R_(j+1)[i−1] in case D1.

[0104] However, in certain preferred embodiments, the subject methodhandles the general case of up to k errors, where an error can be eitheran insertion, a deletion, or a substitution (the Levenshtein or theedit-distance measure). Overall, instead of one additional R¹ array, kadditional arrays R¹, R², . . . , R^(k) are maintained, such that arrayR^(d) stores all possible matches with up to d errors. The transitionfrom array R_(j) ^(d) to R_(j+1) ^(d) is determined. There are 4possibilities for obtaining a match of the first i nucleotides with ≦derrors up to t_(j+1):

[0105] 1. There is a match of the first i−1 nucleotides with ≦d errorsup to t_(j) and t_(j+1)=pi. This case corresponds to matching t_(j+1).

[0106] 2. There is a match of the first i−1 nucleotides with ≦d−1 errorsup to t_(j). This case corresponds to substituting t_(j+1).

[0107] 3. There is a match of the first i−1 nucleotides with ≦d−1 errorsup to t_(j+1). This case corresponds to deleting p_(i).

[0108] 4. There is a match of the first i nucleotides with ≦d−1 errorsup to t_(j). This case corresponds to inserting t_(j+1).

[0109] R is denoted as R⁰, and the method assumes that t^(j+1)=s_(c).The subject method can provide the following expression for R_(j+1)^(d):

[0110] R₀ ^(d)=11 . . . 100 . . . 000 d ones. $\begin{matrix}{R_{j + 1}^{d} = {{{Rshift}\left\lbrack R_{j}^{d} \right\rbrack}\quad {AND}\quad S_{c}\quad {OR}\quad {{Rshift}\left\lbrack R_{j}^{d - 1} \right\rbrack}\quad {OR}\quad {{Rshift}\left\lbrack R_{j + 1}^{d - 1} \right\rbrack}\quad {OR}\quad R_{j}^{d - 1}}} \\{= {{{Rshift}\left\lbrack R_{j}^{d} \right\rbrack}\quad {AND}\quad S_{c}\quad {OR}\quad {{Rshift}\left\lbrack {R_{j}^{d - 1}\quad {OR}\quad R_{j + 1}^{d - 1}} \right\rbrack}\quad {OR}\quad {R_{j}^{d - 1}.}}}\end{matrix}$

[0111] Overall, there are total of two shifts, one AND, and three ORsfor each R_(d). There are k+1 arrays, so the total amount of work isO((k+1)n). An important feature of this algorithm is that it can berelatively easily extended to several more complicated patterns,including accounting for intronic sequences present in the genomicsequence.

[0112] If the number of errors is small compared to the size of thepattern, then the running time can be improved in some instances by whatis referred to as the partition approach. Suppose again that the patternP is of size m and that at most k errors are allowed. Let${r = \left\lfloor \frac{m}{k + 1} \right\rfloor},$

[0113] and let P₁, P₂, . . . , P_(k+1) be the first k+1 blocks of P eachof size r. In other words, P₁=p₁ p₂ . . . p_(r), . . . ,P_(j)=p_((j−1)r+1) . . . p_(jr). If P matches the text with at most kerrors, then at least one of the P_(j)'s must match the sequenceexactly. All P_(j)'s can be searched at the same time and, if onematches, then the whole pattern can be checked directly within aneighborhood of size m from the position of the match. Since, in thisembodiment, the method looks for an exact match, there is no need tomaintain all k of the R^(d) vectors. This scheme will run fast if thenumber of exact matches to any one of the P_(j)'s is not too high. Thenumber of such matches depend on many factors including the values of rand m.

[0114] (C) Multiple Patterns

[0115] In the process of calculating the various potential codingsequences for a given amino acid sequence, the subject method willgenerating multiple sequence tags. In general, one will want to find alloccurrences of any of these sequence tags. Under those circumstances,the pattern searching against the genomic sequence(s) can be conductedone at a time or together.

[0116] The advantage of searching for all of sequence tags together isthat it can be done in one scan (and in one command). Suppose that oneis looking for P₁, P₂, . . . , P_(r). All of the sequence tags can beconcatenated and put in one array (using as many words as needed), andapply the algorithm on that array with the following modifications. LetM be a bit array the size of the combined sequence tags, and let bit ibe 1 if and only if i corresponds to the first character of any of thesequence tags. For each sεS, two bit arrays are built. The first, S_(s)is identical with the one described above. It is used to determine if amatch occurs. The second array S′_(s)=S_(s) AND M. It indicates whethers is the first character of any pattern. If so, then one must start thematch at that pattern: e.g., the method should not depend on the end ofthe previous pattern. Thus, after computing R_(j), the method performsan OR function with S′_(s) (where s=t_(j)). The rest of the R arrays arecomputed as before, except that in each step they are OR'd to a specialmask that sets the first d bits in R^(d) of each separate pattern to 1;this allows d initial errors in each pattern.

[0117] The multi-pattern matching algorithm described above can be usedto solve the approximate string-matching problem for searching reversetranslated sequences against genomic sequences. Let P=p1, p2, . . . , pMbe a pattern string, and let T=a1, a2, . . . aN be a text string. Wepartition P into k+1 fragments P1, P2, . . . , Pk+1, each of sizem=M/(k+1). Let Tij=ai, . . . , aj be a substring of T. By a pigeonholeprinciple, if Tij differs from P by no more than k errors, then one ofthe fragment must match a substring of Tij exactly.

[0118] The approximate string matching algorithm is conducted in twophases. In the first phase the sequence is partitioned into k+1fragments and uses the multi-pattern string matching algorithm to findall those places in the genomic sequence that contain one of thefragments. If there is a match of a fragment at position i of thegenomic sequence, the system marks the positions i−M−k to i+M+k−m as a“candidate” area. After the first phase is done, an approximate matchingalgorithm as described above to find the actual matches in those markedarea. In an illustrative embodiment, the pseudo-code for the subjectmethod may be illustrated by: Let p be the current position of the text; while (p < N) /* N is the end position of the sequence text */ {blk_idx = map(ap −b +1 ap −b +2 . . . ap ) /* map transforms a string ofsize b into an integer */ shift_value = SHIFT [blk_idx ]; if(shift_value > 0) p = p + shift_value; else compute the hash value of ap−μ+1 . . . ap ; compare ap −μ+1 . . . ap to every pattern that has thesame hash value; if there is a match then reports ap −μ+1 . . . ap ; p =p + 1; }

[0119] IV. Uses in Proteomics

[0120] Mass spectrometry has emerged as a central technique in a widevariety of functional genomics, or proteomics approaches to study genefunction in the post-genomics world. Mass spectrometric instrumentationcontinues to become more powerful and novel instrumental concepts arebeing put into use. The subject genomic searching system can be used aspart of a proteomics discovery method.

[0121] For instance, the subject method can use peptide sequenceinformation obtained by mass spectrometry as the identification methodin “expression proteomics”, sequencing data from with two-dimensionalgels of two different biological states.

[0122] Several interesting approaches have been taken recently towardsthe analysis of the proteome without the use of gel electrophoresis. Inone such approach, the protein population is separated by a variant ofcapillary electrophoresis and the intact proteins are then eluted into aFourier transform ion cyclotron resonance mass spectrometer (FT ICR).The FT ICR is capable of storing the ions and measuring them atextremely high resolution and mass accuracy using a frequency basedmethod. Measurement of several hundreds protein components from lysatesof Escherichia coli or yeast has already been shown. Jensen et al.(1999) Anal. Chem. 71:2076. Using a variant of the tandem massspectrometric method, it may also be possible to identify the proteins“on-line” as they elute into the mass spectrometer. See, for example,Mørtz et al. (1996) PNAS 93:8264-8267; and Li et al. (1999) Anal. Chem.71:4397.

[0123] In another approach, crude protein mixtures are digested insolution without separation. The resulting peptide mixture is thenanalyzed by the LC/MS method outlined above. Yates et al. (1997) ProteinChem. 16:495; and Link et al. (1999) Nat. Biotechnol. 17:676. As thecapacity of the mass spectrometer to sequence co-eluting peptidesincreases, more and more complex protein mixtures can be analyzed.

[0124] (A) Multi-Protein Complexes

[0125] In one embodiment, the subject method is used to search genomicdatabases for sequences derived from multi-protein complexes, e.g.,assemblies with a particular function such as splicing, transport ornuclear import/export. One use of proteomics technology is to determinethe make up of such complexes. To this end, they need to be purifiedspecifically, the identity of the factors in the complex needs to bedetermined and finally the in vivo presence of the novel members of thecomplex needs to be established.

[0126] (B) Signaling Pathways

[0127] The subject method can also be used as part of a proteomicdiscovery method to elucidate transient rather than structuralcomplexes. Many signaling cascades are transmitted through multi-proteincomplexes involving scaffolds and these complexes can be biochemicallypurified.

[0128] (C) Organelles

[0129] In still other embodiments, e.g., apart from multi-proteincomplexes, the subject method can be used identify proteins in cellularorganelles. For instance, organelles can be purified and theircomposition analyzed by mass spectrometry. Since organelles are oftenless well defined than smaller multi-protein complexes, the task ofverification of identifications becomes even more important.

[0130] V. Business Methods

[0131] Yet another aspect of the present invention relates to a methodof conducting a proteomics business, comprising:

[0132] (i) by the above-described method, determining the identity of atarget gene encoding a protein isolated on the basis of the proteinbeing (a) involved in an interaction of interest, (b) having a cellularlocalization of interest, (c) having a differential expression patternof interest, or (d) being post-translationally modified;

[0133] (ii) identifying agents by their ability to alter the level ofexpression of the target gene or the activity of an expression productof the target gene;

[0134] (iii) conducting therapeutic profiling of agents identified instep (b), or further analogs thereof, for efficacy and toxicity inanimals; and

[0135] (iv) formulating a pharmaceutical preparation including one ormore agents identified in step (iii) as having an acceptable therapeuticprofile.

[0136] The subject business method can include the additional step ofestablishing a distribution system for distributing the pharmaceuticalpreparation for sale, and may optionally include establishing a salesgroup for marketing the pharmaceutical preparation.

[0137] Still another aspect of the present invention provides a methodof conducting a proteomics business, comprising:

[0138] (i) by the above-described method, determining the identity of atarget gene encoding a protein isolated on the basis of the proteinbeing (a) involved in an interaction of interest, (b) having a cellularlocalization of interest, (c) having a differential expression patternof interest, or (d) being post-translationally modified;

[0139] (ii) (optionally) conducting therapeutic profiling of the targetgene for efficacy and toxicity in animals; and

[0140] (iii) licensing, to a third party, the rights for further drugdevelopment of inhibitors or activators of the target gene.

[0141] VI. Exemplary System

[0142] In an illustrative embodiment, there is provided a proteinidentification program (ID program) comprising two main components: aserver application with sequence database search routines that includeclient interface(s). Merely to illustrate, the ID program can beautomated via the Microsoft Access databases ProAutoDB and ProLogDB andassociated Visual Basic applications. Control of automation and dataflow can be as follows: from the ID program GUI it is specified to querye.g. the ‘FavoritelndexFile’ from a list of several virtual index files.Elsewhere it is specified that ‘FavoritelndexFile’ is actually e.g.particular genomic sequence databases. Upon finding matches with scoreshigher than a predefined value, the search result and all searchparameters can be logged, also in another prespecified database, andfurther searches on the dataset can be aborted or continued aspredefined in the automation database.

[0143] Special automated actions can also be triggered by certaindatabase retrieval events, e.g. the matching of a data set to a specificORF (Open Reading Frame) could result in an e-mail being sent with allavailable information to a person with particular interest in thisgene/protein.

[0144] A few terms used in these examples include:

[0145] “ProteomeDB”: A sample/identification proteomics database forlogging and correlating information such as sample identity, gel photos,mass spectra (and features therein), search results, etc. This databasecan be the final destination of data but can also be regarded as atemporary storage facility for data that is subsequently transferred by,e.g., standard SQL commands to other databases (e.g. Oracle and Sybasedatabases) on a remote server.

[0146] “ID program Flow Agent”: A software daemon to automate thetransfer and utilization of mass spectrometric data.

[0147] “ProAutoDB”: Database(s) containing search parameters and relatedinformation regarding data sets that have been scheduled for laterautomated (and repeated) searching against sequence databases.

[0148] In certain embodiments, all incoming samples are logged intoProteomeDB, before any analyses are carried out. Each logged sample isthen automatically given a unique ID number that can be used to sortsubsequently generated mass spectrometric data and database searchresults. In certain preferred embodiments, ProteomeDB will be able todownload digestion protocols to a robotic workstation and supply allrelevant sample information directly into MALDI and ES mass spectrometercontrol software. This means that setting up the analysis of a batch ofsamples will be done automatically.

[0149] Referring to FIG. 7, mass spectrometric data is acquired either:i) manually; ii) automatically through built in features in the MSsoftware; or iii) governed by scripts. The relevant MS information, e.g.a peptide mass list or fragment mass list is passed to ProAutoDB eitherby the MS control software directly, or via ID program Flow Agent. IDprogram Client checks ProAutoDB for new tasks at set intervals and uponfinding a job then executes the sequence database search. The outcome ofevery search is logged in ProLogDB, and if sequence database entriesachieve a scoring value above a set threshold then these proteins arealso logged back into ProteomeDB under the pertinent sample record.

[0150] (i) ProteomeDB Files

[0151] In the illustrated example, ProteomeDB is a hierarchic databaseas can be developed for Microsoft Access. It may contain tables, forms,reports and a VisualBasic module or the like. Briefly, a batch can havemany samples, each of which can have many mass spectra, each of whichcan have many database search results. The form set and the databasetables can also be separated (called ‘split’) such that the data canreside on a central file server and be simultaneously accessible to agroup of users, each of whom should have a copy of the form set on theircomputer.

[0152] (ii) The Navigation Switchboards

[0153]FIG. 8 shows an exemplary first window that becomes availableafter opening ProteomeDB is the main ‘Switchboard’. The appearance ofthe switchboard can be modified to display the logo and colors of acompany. The ‘Enter New Batch’ button can be used to enter the datarelating to a new batch of samples. One or more secondary switchboardsgive access to most of the sub-forms for more direct and simplifiedentry of data (e.g., going into one table at a time). See FIGS. 9-11.

[0154] (iii) Batch Overview: Information Relating to the Entire Batch

[0155] The primary batch information can be one record. See FIG. 12.These two forms, or views, can be set up to be the most used ProteomeDBinterfaces; i.e., they are the ‘top level’ where a batch of samples isset in line for analysis and the report option is finally chosen.Ideally, it is only necessary to type in the number of samples in thebatch and the name given to each sample by the owner of the batch. Thereis no way of predicting what Web call their samples, so this task ispreferably not automated. The information in all the sub forms andsurrounding bits of information may be either:

[0156] entered automatically at different stages by other applications.Examples of this functionality are spectrum names, peak lists, analysisdates etc.

[0157] reused information from earlier batches. Examples of reuse aredigestion protocols (and protocol steps), contact person information,etc.

[0158] (iv) Companies Sub Form

[0159] The Companies sub form shows all the information stored on eachsingle contact person. Not all information is necessarily used for eachrole that the contact subsequently has. For example, only the personchosen in the ‘Contacts information’ tab may be allocated the Web accesscode, and only the person in the tab ‘Billing information’ shows ashaving a Tax identification number associated.

[0160] List of Samples: A list can be provided which contains theinformation for each batch, namely the sample names along with thecorresponding identification or sequence information that was found.When setting up a new batch, the Web can be prompted to start by settingthe number of samples in order to obtain an auto-generated list ofunique sample ID numbers. The analysis status of a new sample can be bydefault ‘Received for identification’, with other status possibilitychosen manually, such as for example, ‘Received for sequencing’.

[0161] Prewritten report options: In certain embodiments, the ProteomeDBcan maintain reports, e.g., for printing or electronic documentation, inseparate text files pertaining to the relevant analytical results fromeach batch. To illustrate, the system may offer some report options thatexclusively deal with the results:

[0162] The ‘Short status report’: a very short information abstract thatshows the results from the entire batch on one or a few pages.

[0163] The ‘Search Details paper’: includes results lists from eachsearch and can occupy several pages per sample. In many cases, theresearchers doing the actual protein analysis may not be the ones whoown (submit) the samples and need the results. This means thatcommunication of results to other parties is needed, and for thatpurpose there are some extra report options;.

[0164] The ‘Receipt of samples’ fax: to confirm the arrival of thesamples, also states the batch ID and Web access code that will allowthe owner of the samples to follow the analysis progress via the Web.

[0165] The ‘Report letter’: a letter to accompany one or all of theresult reports mentioned above. The information that needs to beconveyed about the analyses can often be similar from project toproject. Therefore it may be useful to have a selection of informativestandard paragraphs that may be included (or excluded) in this letter.

[0166] The ‘Invoice’: This module may generate the finished invoice, orcan be used for interdepartmental billing. Invoice numbers can beassigned when the ‘Batch status’ is changed to ‘Completed’.

[0167] Information about the batch: Date fields allow entry of the dateswhere: I) the samples were dispatched by their owner; II) the sampleswere received for analysis; and III) the analyses were completed. Incertain embodiments, the system will provide queries to check currentstatus at various stages of the project work.

[0168] (v) User Designated Search Parameters

[0169] The window in FIGS. 13 and 14 contain the primary informationthat can be used in the database query, e.g., under the massspectrometric data. In the illustrated interface, the user can createsearches using peptide maps, peptide sequence tags, breakpoint, andsequence alone. Help lines pertaining to any parameter field can beprovided, shown here in the lower left-hand corner of the window, e.g.,by leaving the cursor over the field of interest.

[0170] Nucleotide databases can be queried by peptide maps by the IDprogram version. ‘Breakpoint’ searches require a defined minimum numberof fragment ion masses to match theoretically expected fragment masses.See FIG. 15. For example, for a database entry to match, from a list of10 masses. The system may require that at least 5 of these masses mustbe possible Y-ions.

[0171] Main parameters are the precursor mass along with a list offragment masses of which a requested number must theoretically matchcalculated fragment masses of Y, B, or Y AND B ions. See FIG. 16. The MSerror may, for example, be chosen very large (say, 50 Da) to accommodatefor modifications and substitutions etc. To regain search specificity, avery small MS/MS error should then be used (for example, 20 ppm).

[0172] This search method is useful for searching on completelyuninterpreted data. However, the search specificity is not as high asfor sequence tag searches.

[0173] (vi) Additional Sample Information Dialog Box

[0174]FIG. 17 shows a sample dialog box for entering and viewinginformation that is secondary to the database searches, i.e. it isunnecessary for the search itself but which may be relevant to theinformation flow following completion of the search. All of thisinformation is expected to be entered automatically, either by the IDprogram Client itself or by the ID program Flow Agent parsinginformation to ID program. In the illustrated case, search informationcan then be logged in ProteomeDB (if logging is selected) but ONLY if aunique sample record can be assigned. This requires the field ‘SampleID’ to be filled out correctly. There must also be a spectrum name.Other fields can remain empty and still allow logging.

[0175] (vii) Automating ID Program

[0176]FIG. 18 shows a search parameter window for automating patternmatching. The ‘Search life cycle’ can be set in the Automation tab ofthe search window. Parameters that may be required by the subject systeminclude:

[0177] 1. When a new search should be run if the initial search fails;e.g. a certain number of days, next database update, etc.

[0178] 2. On which computer the search should run.

[0179] 3. The definition of a failed search. The system may beinstructed to continue searching until the score of the best match ismore than this value and the score of the second best match (if any) isless than a percentage of the best match. This means that the searchwill be scheduled if the score of best match was not high enough. Ascore of 1 means that no searches will be scheduled. A percentage of 99means that the score of the second match will be ignored.

[0180] 4. How often the search should be run. For example, the usercould specify to run the search once, or each day/week for a number ofweeks.

[0181] In certain embodiments, the ID program Client can be configuredto send an e-mail to a user when a match is found or when the searchlife cycle has ended and no match has been found. For instance, the IDprogram Client can use the Simple Mail Transfer Protocol to send ane-mail.

[0182] (viii) Logging Searches

[0183]FIG. 19 shows one embodiment in which the options for loggingsearch results can be specified by the user. These log files can be,e.g., local or on a remote file server. If the ProLogDB file does notexist, ID program Client will create such a file.

[0184] The ProteomeDB database file can be the file that contains thetables. This means that if a database is split into forms and tables(e.g., by Microsoft Access function) then ID program Client must alsokeep track of the various parts.

[0185] (ix) The Multi Template Interface

[0186] In certain embodiments, it will be desirable to automaticallyapply other search parameters if the search failed using the standardparameters. See FIG. 20.

[0187] In such embodiments, the user may be prompted to set the standardsearch parameters to the most accurate values. If the standard searchfails then the user may define one or more follow-up searches using,e.g., less stringent criteria.

[0188] (x) The Search Results Window

[0189] Matching entries to a query can be returned in a dynamic tablethat allows alphabetic sorting in either ascending or descending orderfollowing any column contents. Default sorting may follow a score, whichfollows empirical scoring algorithms based on observations from hundredsof database searches. For example, these may have the form:

[0190] Peptide Map:${{Score} = {{Ks} \cdot \frac{\left( {\sum\limits_{n = 1}^{Nm}\left( {1 + {{Ke} \cdot {{\Delta \quad {Mn}}}^{K\quad \Delta}}} \right)^{- 1}} \right)^{2}}{{Nu} \cdot \sqrt{Mprot}}}},$

[0191] where Ks=1100; Nu=the number of peptide masses entered; Nm=thenumber of matching masses; Ke=1.0; DMn=the absolute mass error; KD=1.0;and Mprot=protein Mw in KDa

[0192] Sequence Tag: ${{Score} = \begin{Bmatrix}{{S2ifS2} \geq 0} \\{{0{if}\quad {S2}} \leq 0}\end{Bmatrix}},$

[0193] where S2=S1+Sn+Sc, where Sn and Sc are 0 if N- and C-terminalspecificity are false and 100.00 if they are true, respectively.

[0194] S1=K max−{square root}{square root over (Ke·ΔMpep)}, whereKmax=500; Ke=100; and Dmpep=the absolute mass error.

[0195]FIG. 21 shows an illustrative Search Result Window. Selecting andthen ‘right-clicking’ on an entry in the result window, for example, canbring up a menu with a selection of information windows to furtherenhance the analysis of that entry.

[0196]FIG. 22 shows an exemplary 2nd pass check window. To obtain a 2ndpass check of a given matching database entry, the entry can beselected, e.g., by left-clicking in any field its row on the Resultwindow, and then either right-click to choose or go to the side bar tochoose the ‘2nd passcheck’ window. The window displays the entiresequence information in the index file with the matched sequence pieceshighlighted in different colors. Sequence covered by one matchingpeptide is a first color, that covered by two peptides is another color,and so forth.

[0197] (xi) Database Entry Window

[0198]FIG. 23 shows an exemplary database entry window. The illustratedbrowser window is for fast access to e.g. SwissProt and BLAST searchesat NCBI. The addresses of the databases are listed in a settings fileand can be changed to utilize Intranet mirrors instead of the presentlychosen sites.

[0199] (xii) Result Summary List Window

[0200]FIG. 24 illustrates what a Result summary window may look like.There is a small check box in the upper right-hand corner of each searchresult window. When checked, the contents of the individual searchresult windows are parsed to the summary window. Here the result listsare interleaved to allow the proteins found to multiple times registerwith counted occurrence and added score values. This is meant to providea simple and fast means of comparing data from several individuallynon-specific searches (as arising from short sequence tags and lowabundance MALDI maps, for example).

[0201] To identify which search(es) a particular entry was found in, theuser can select the entry and choose ‘Find result entry’. This willbring the pertinent single result windows to the foreground whilehighlighting the entry of interest in each list.

[0202] (xiii) The Database Browser Window

[0203]FIG. 25 is an illustration of a database browser window. In thisexample, the ProLogDB file whose path is specified in the ‘Logging’ tabcan be browsed directly from the ID program Client by choosing‘ProLogDB’ in the ‘View’ menu. The result list can also show in fulllength in a separate window (not displayed here) whenever a record isselected (highlighted). Alternatively, it is possible to work with thecontents of these files via Microsoft Access or the like.

[0204] (xiv) Miscellaneous Features

[0205]FIG. 26 shows two sample windows for permitting the user toconvert a nucleic acid sequence to the corresponding amino acidsequence.

[0206]FIG. 27 shows an exemplary search parameter window for calculatingtheoretical fragment masses from peptide sequences. Selecting an entryfrom any result window from searches other than on peptide maps willenable the calculation of theoretical fragment mass values. These can besorted (ascending and descending), for example, by each of the columntitles shown in the window below.

[0207]FIG. 28 is a window which can be used as an interface fortranslating DNA sequence data that is either typed or copied into thewindow. It is also possible to type in a stretch of amino acid sequenceto check for the occurrence of the sequence in each reading frame. Thisfeature can be used for the validation of found ESTs on queries by MS/MSdata. However, the window can also be used generally for highlightingamino acids sequence stretches in longer sequences of amino acids copiedin (disregarding the use of the translation facility).

[0208] (xv) Overview of an ID Program Flow Agent

[0209] The ID program Flow Agent can function as conduit for control andinformation between the mass spectrum acquisition software and IDprogram by transferring a list of peptide masses from an MS peptide mapto ID program client for subsequent database search. The ID program FlowAgent can monitor specified folders for the arrival of new peak listsand then transfers these with or without relevant specific searchparameters to ID program. This application is generally useful oncomputer systems that are directly in control of mass spectrometric dataacquisition and handling or wherever the mass spectra are stored, but italso works well over a network.

[0210] VII. Exemplification

Example 1

[0211] Proteome projects seek to provide systematic functional analysisof the genes uncovered by genome sequencing initiatives. Massspectrometric protein identification is a key requirement in thesestudies but to date, database-searching tools rely on the availabilityof protein sequences derived from full-length cDNA, expressed sequencetags (ESTs) or predicted open reading frames (ORFs) from genomicsequences. We demonstrate here that proteins can be identified directlyin large genomic databases using peptide sequence tags obtained bytandem mass spectrometry. On the background of vast amounts ofnon-coding DNA sequence, identified peptides localize coding sequences(exons) in a confined region of the genome, which contains the cognategene. The approach does not require prior information about putativeORFs as predicted by computerized gene finding algorithms. The methodscales to the complete human genome and allows identification, mapping,cloning and assistance in gene prediction of any protein for whichminimal mass spectrometric information can be obtained. Several novelproteins from A. thaliana and human have been discovered in this way.

[0212] A. Materials and Methods

[0213] Proteins: Protein samples from Arabidopsis A. thaliana wereexcised from a 2D PAGE gel of a total membrane-associated proteinpreparation. (Human protein samples were obtained from ongoing researchprojects within our group and through collaborations, see Example 2).Spots were excised from gels and digested with trypsin as describedpreviously. See Shevchenko et al. 1996 Anal Chem 68:850-858.

[0214] Mass spectrometry: MALDI mass spectra were acquired on a BrukerREFLEX III reflectron time of flight (TOF) mass spectrometer(Bruker-Daltonik, Bremen, Germany). Matrix surfaces were made fromα-cyano-4-hydroxycinnamic acid by the fast evaporation method. Vorm etal 1994 Anal. Chem. 66:3281-3287; and Jensen et al. 1996 Rapid CommunMass Spectrom 10:1371-1378.

[0215] About 1-2% (0.3-0.5 μl) of the supernatant of in-gel trypsindigests were injected into an acidified drop previously deposited ontothe matrix surface. The monoisotopic masses for all peptide ion signalsin the acquired spectra were determined and used for database searching.Peptides were also analyzed by electrospray tandem MS on a prototypequadrupole time-of-flight mass spectrometer (MDS Sciex, Toronto, Canada)equipped with a nanoelectrospray ion source (MDS Protana A/S, Denmark).The mixture of peptides obtained by in-gel proteolysis was purified andconcentrated prior to nanoelectrospray MS as described. Shevchenko etal, supra; and Wilm et al. 1996 Anal Chem 68:1-8. For tandem MSexperiments, ions were selected using the mass resolving quadrupole.Mass analysis of fragment ions was performed by the TOF analyzer.Peptide sequence tags were constructed from the tandem mass spectra andused for subsequent database searching.

[0216] Genome databases and searching: The A. thaliana genome databasewas obtained from the curators of the Arabidopsis Genome Initiative atThe Institute of Genomic Research (TIGR), Rockville, Md.). A custom Perlscript was used to convert the downloaded database into a FASTAformatted sequence index file accepted by the PepSea database searchsoftware system (MDS Protana A/S, Denmark). The human genome database(HGdB) was constructed in a similar fashion. Finished and unfinishedhuman genome sequences (phases 0-3) were downloaded from the NCBI ftpsite. Peptide sequence tags and MALDI peptide mass maps were searchedagainst the respective databases using the program PepSea (MDS ProtanaA/S, Denmark). Default search criteria specified trypsin as the proteaseand required measurement accuracy of better than 50 ppm for both intactpeptide ion and fragment ion masses. The amino acid part of the peptidesequence tag was translated into the corresponding degeneratedoligonucleotide sequence. Potential hits in the forward or reversedirection on the human genome data were checked as to whether they codedfor the amino acid sequence of the tag. The mass distance to the N- andC-terminal part of the potential peptide match was then calculated inthe reading frame defined by that match. For the A. thaliana genomicdatabase, searches took 2 s to complete on a PC cluster.

[0217] Gene prediction: Several web-based gene prediction programs wereemployed for further characterization of identified coding regions ofthe A. thaliana and human genome. These included GENSCAN at theMassachusetts Institute of Technology (MIT, Boston, USA), HMMgene at theCenter or Biological Sequence Analysis (CBS, The Technical University ofDenmark, Lyngby, Denmark) and GRAIL at the Oak Ridge National Laboratory(Oak Ridge, USA).

[0218] B. Results and Discussion

[0219] In order to assess if protein derived MS data can be utilized forthe identification of proteins in genomes of higher organisms, weanalyzed a large number of human and Arabidopsis thaliana (A. thaliana)proteins by nanoelectrospray tandem mass spectrometry and subsequentinterrogation of genome sequence databases using the mass spectrometricdata.

[0220] (i) Identification of Proteins in the A. thaliana Genome

[0221] The sequenced region of the 125-megabase A. thaliana genomecovers 115.4 megabases. A combination of algorithms has been used topredict 25,498 putative genes consisting of 132,982 exons with a totallength of 33,249,250 bases corresponding to 29% coding bases (Initiative2000 Nature 408:796-815). From a single two dimensional gel(supplementary material) of a total membrane associated proteinpreparation of A. thaliana, 60 spots were analyzed by a two step massspectrometric procedure (Shevchenko et al. 1996 PNAS 93:14440-14445.)consisting of mass fingerprinting followed by tandem mass spectrometricpeptide sequencing using nanoelectrospray (Wilm et al., supra; and Wilmet al. 1996 Nature 379:466-9).

[0222] At the time of the experiment, approximately 75% of the A.thaliana genome was publicly available. Fifty-one of the 60 proteinsanalyzed were identified in amino acid sequence databases (48 by massfingerprinting, three by tandem mass spectrometry; see supplementaryinformation for details) and nine were novel proteins. Attempts toidentify proteins in the A. thaliana genome by peptide massfingerprinting alone failed because many genomic regions, whentranslated in the three forward or reverse reading frames, give rise toa significant number of randomly matching peptide masses. Instead,tandem mass spectrometric data of a test set of 20 spots consisting ofeleven identified and all nine novel proteins were assembled intopeptide sequence tags (Mann et al. 1994 Anal Chem 66:4390-4399) and usedto search the genomic database of A. thaliana. A peptide sequence tagconsists of a few amino acids that can easily be assigned (manually orby software) in a tandem mass spectrum. These amino acids are ‘locked’into position within the peptide by the ‘start’ and ‘end’ masses of afragment ion series. Together with the mass of the intact peptide, asearch template is created. The use of accurate peptide and fragment ionmass information in addition to amino acid sequence informationincreases the search specificity of a peptide sequence tag by more thana million fold over the short amino acid tag sequence alone. Searchesusing the peptide sequence tag algorithm (Mann et al. 1994 Anal Chem66:4390-4399) were performed directly on the genomic data without priortranslation of the genome sequences into amino acid sequences (thesearch is performed once in the forward and reverse directions of thenucleotide sequence) and without regard for predicted coding regions orreading frames. Peptide sequence tags containing as few as two or threeamino acids almost always identified a single location in the A.thaliana genome. The specificity of the peptide sequence tags is aidedby the high mass accuracy (between 5 and 50 ppm depending on signalstrength) provided by quadrupole time of flight instruments even onfemtomole amounts of gel separated proteins. Morris et al. 1996 RapidCommun Mass Spectrom 10:889-896; and Shevchenko et al. 1997 Rapid CommunMass Spectrom 11:1015-1024. Because the full retrieved peptide sequenceis verified against the fragmentation spectrum, the specificity of apeptide sequence tag search approximates that of the corresponding fullpeptide sequence. In all cases, proteins could uniquely be identified bya combination of two peptides because they resulted in hits within aconfined region of the genome. Four peptide sequence tags were observedto cluster in a 2 kb region of the A. thaliana genome. Whenever apeptide sequence tag unambiguously identifies the corresponding DNAsequence in the genome, this sequence must be part of an exon. Thepeptide therefore locates the exon and establishes the correct readingframe. In-frame stop codons upstream and downstream of the identifiedpeptide also limit the extent of the exon within which the splicesignals (exon intron boundaries) must be found. This information isuseful for the reconstruction of the gene from the nucleotide sequence(see below).

[0223] Of the eleven known proteins in the test set, seven wereunambiguously identified in the A. thaliana genome. Four proteins didnot result in a hit in the 75% genomic sequence available, consistentwith the result of searching their known sequences in that databasewhich revealed that they were not yet present. Of the nine novelproteins, we identified five in the A. thaliana genome whereas theremaining four, despite high quality mass spectrometric data, did notresult in a match in the database. We therefore concluded that theseproteins were not yet present in the genome database (Table 1). Duringthe preparation of this manuscript, the A. thaliana genome waspublished. Initiative, supra. Searching the previously unidentified datain the complete A. thaliana genome, yielded the identification of allthe novel proteins, indicating a 100% success rate of the methodpresented here.

[0224] Unambiguous identification of peptides in the genome directlyprovides the information necessary for further analysis of thecorresponding gene. The identified exon sequences define the directionof the nucleotide sequence. The identified exon can be used directly forhomology searching, and as probes for cloning the genes. Furthermore,the above identifications map the respective genes to their locations inthe genome.

[0225] Once part and direction of the coding sequence of a protein hadbeen found in the genome by peptide sequence tags, the massfingerprinting data obtained in the first step of the mass spectrometricanalysis was used to obtain further information about the genestructure. Since only peptide masses are available in peptide massmapping, the identified genomic region (approximately 10 kb for A.thaliana) was translated in three reading frames. The exon sequencecoverage can be refined and additional exons can sometimes be discoveredby peptide mass mapping.

[0226] Peptide sequences can also be used to join adjacent exons. Forexample, the underlined part of the peptide sequence TFDESKETINKEIEEK(SEQ ID No: 1), derived from MS sequencing of the protein S8, is locatedin exon 1 and the remainder in exon 2 (Table 1) of the gene. Forproteins from A. thaliana, these ‘peptide exon bridges’ were frequentlyfound (on average one per protein) but not as frequent as to alwaysallow the full reconstruction of the gene.

[0227] While gene prediction in genomic DNA of higher organisms isdifficult, a combination of mass spectrometric data and computationalexon prediction can be very effective at defining gene structure. Thegenomic region identified through MS data for spot S8 was analyzed withGENSCAN, HMMgene and GRAIL and compared to the known sequence of thisprotein. Both predictions by GENSCAN and HMMgene missed one exon andpredicted a surplus exon whereas GRAIL predicted several splice sitesincorrectly. However, when the peptides identified by MS were used asconstraints for coding sequence prediction (using HMMgene), the surplusexon was no longer predicted and the previously missed exon was included(though still with a splice site error). Mass spectrometric data,together with the genome sequence, rectified all incorrect splice sites,led to inclusion of the complete exon that was missed previously andshowed that the surplus exon was not present.

[0228] While useful in many cases, the extent to which a predicted genemodel can be verified or refined by the MS data obviously depends on thenumber of exons actually identified by peptide sequence tags. We nowperform database searches in real time during the mass spectrometricexperiment, combine it with gene prediction to increase the number ofidentified exons by sequencing of the appropriate peptides and todirectly sequence potential exon spanning peptides.

[0229] (ii) Identification of Proteins in the Human Genome

[0230] The size of the human genome is approximately 25 times that of A.thaliana and it is estimated that only 3% of the nucleotide sequence iscoding for proteins. To learn about the feasibility of identifyingcoding sequences in the human genome on the background of the vastamounts of non-coding sequence, we searched data from more than 200peptides which we have sequenced by mass spectrometry in variousprojects against an estimated 80% of the human genome which was publiclyavailable at the time of writing. The results of these experiments aresummarized in table 2. Peptide sequence tags comprising four amino acidresidues retrieve only a single entry if the peptide is indeed in thehuman genome and none if it is not. With a three amino acid tag, thesearch retrieves on average two sequences, only one of which fits thespectrum when comparing all calculated fragment ion masses for theretrieved peptides with the experimental spectrum. With a two amino acidtag, on average seven peptide sequences are retrieved. Evaluation of thesequences yields a unique result in almost all cases except when thepeptide is too short to be unique in the database (<10 amino acids).Incidentally, we found that tryptic peptides encountered in MSsequencing are typically longer than 10 amino acids and thus were almostalways unique in the human genome. As in the case of searches in the ATgenome, however, data from any two peptides was always sufficient for anunambiguous localization of the protein in the genome. This is becauseit is extremely unlikely that any of the few retrieved sequences, evenfrom short peptides, happen to ‘co-localize’ in the same gene on achromosome by chance. Still, we recommend the use of at least twopeptides per human protein because we find that there is about a 25%chance for a peptide to be spanning an intron exon boundary. Suchpeptide exon bridges can be detected if another peptide has previouslyidentified the general location of the gene. We are currently working onsoftware that will allow the identification of peptide exon bridges inthe complete genome.

[0231] The evaluation, or matching, of the retrieved sequences againstthe mass spectrum, while unambiguous, was done manually in this project.However, we have investigated the use of an objective criterion to matchthe spectra per computer. We have found that the ratio of fragments thatadditionally match the tandem mass spectrum is at least twice as highfor the correct sequence than it is for sequences that only matched theformal sequence tag criteria. We also found that fragment ions thatcontinue a tag series were particularly powerful for discriminationbecause a fragment which continues an ion series effectively converts atag with n amino acids into one with n+1, leading to reduced matches asshown in Table 2.

[0232] Altogether, we searched experimental data against the humangenome database from 49 human proteins from ongoing projects where wehad tandem mass spectral data on at least two peptides. Of the 49proteins, 41 (84%) were unambiguously found in the genome by the methodsand criteria explained above. Eight proteins (or 16%) of the proteinswere unambiguously found not to be contained in the version of thepublic database that we searched. That version is estimated to contain80% of the human genome in various states of sequence qualityrefinement. However, we found all of the eight proteins in non-redundantprotein or EST databases by using the mass spectrometric data. We thenperformed BLAST analysis of these sequences against the human genomewhich confirmed that they were not yet present in the genome. Hence,like in the case of searching the A. thaliana genome, the success rateof searching the human genome with MS data was also 100% and shows thatthe protein identification method described here scales to very largegenomes with a low proportion of coding sequence.

[0233] (iii) Identification of a Novel Human Protein

[0234] Data obtained in our group is now routinely used to identifyhuman proteins in the genome database using the methodology presented inthis paper. As an example, we were able to identify a novel humanprotein starting from a weak silver stained spot from a two dimensionalgel. The spot turned out to consist of mixtures of proteins, three knownproteins were found in a protein sequence database (data not shown). Oneadditional protein was identified in the human genome database asfollows. Four peptide sequence tags were obtained by nanoelectrospraytandem mass spectrometry (Table 3) and queried against the human genomedatabase. Their matches formed a cluster located in two exons and theidentification was confirmed by corresponding peptide signals in thepeptide mass fingerprint. The gene model proposed by both GENSCAN andHMMgene of the genomic region containing the two exons was compatiblewith all MS data. Sequence analysis of the identified protein revealedthe presence of a signal sequence at the N-terminus of the proteins aswell as a Jacalin domain (PFM), which is also found in animal prostaticspermine binding proteins.

[0235] The methods presented here can be used in small and large-scaleproteomics projects for all organisms that have sequenced genomes aswell as their close relatives. Given the availability of minimal massspectrometric peptide fragmentation data, it is possible to identify anyprotein from those organisms whether or not additional sequenceinformation in the form of ESTs is available. The approach does not relyon a completely assembled genome sequence, only on full coverage of thegenome, which, to date, can be achieved relatively quickly even forlarge genomes. Furthermore, together with ongoing bioinformatics andcomparative genomics efforts, existing EST projects and plannedfull-length cDNA projects, mass spectrometry combined with genomesearches will play a valuable tool for discovering and characterizingthe proteins coded by the human and other genomes and provide directaccess to cloning of those molecules. TABLE 1 Peptide sequence tagidentification of proteins from A. Thaliana *Peptide sequences are theresult of searches by peptide sequence tags in the A. thaliana genome.For all hits in the genome data, further peptides were identified usingthe MALDI MS peptide mass map (data not shown). Proteins that did notresult in a hit in the 75% genomic sequence or the non-redundant proteindatabase available at the time of the experiment. Apparent NRDBaccession Protein Mw [kDa]/pI No. Sequences identified in genome 1 (S11) 30/5.5 tnew AC007019 FQAAVDILR; (SEQ ID No: 2) IKHDIDTETQDIPDAR; (SEQID No: 3) ITLDPEDPAAVK; (SEQ ID No: 4) VFFDIK; (SEQ ID No: 5)AQLDELKSDAVEAMESQK; (SEQ ID No: 6) SDKKGMDLLVAEFEK; (SEQ ID No: 7)KEDLPKYEENLELSMAK (SEQ ID No: 8) 2 (S8)  44/4.8 spt Q96262 YLEELVK; (SEQID No: 9) VSVFLPEEVK; (SEQ ID No: 10) VVETYEATSAEVK; (SEQ ID No: 11)EIPVEEVKAEEPAK (SEQ ID No: 12) 3 (S10)  55/6.6 spt 39206 AIAFDEIDKAPEEK;(SEQ ID No: 13) FPGDDIPIIR; (SEQ ID No: 14) VGEEVEILGLR; (SEQ ID No: 15)GSALSALQGTNDEIGR; (SEQ ID No: 16) LMDAVDEYIPDPVR (SEQ ID No: 17) 4 (S5) 60/4.3 spt 004151 TLVFQFSVK; (SEQ ID No: 18) FYAISAEFPEFSNK (SEQ ID No:19) FYAISAEFPEFSNKDK (SEQ ID No: 20) 5 (S12)  65/5.1 CAB37531AVVTVPAYFNDAQR; (SEQ ID No: 21) EVDEVLLVGGMTR; (SEQ ID No: 22)GVNPDEAVAMGAAIQGGILR (SEQ ID No: 23) 6 (S6a)  74/5.0 spt Q39042LVPYQIVNK; (SEQ ID No: 24) DAGVIAGLNVAR; (SEQ ID No: 25) FDLTGVPPAPR;(SEQ ID No: 26) FEELNNDLFR; (SEQ ID No: 27) IMEYFIK (SEQ ID No: 28) 7(S44) 100/6.4 sptnew AAD25640 ILLESAIR; (SEQ ID No: 29) TSLAPGSGVVTK;(SEQ ID No: 30) FYSLPALNDPR; (SEQ ID No: 31) VVNFSFDGQPAELK; (SEQ ID No:32) SENAVQANMELEFQR (SEQ ID No: 33 8 (S183)   12/10.0 swiss P34893 (a)VIAVGPGSR; (SEQ ID No: 34) EGDTVLLPEYGGTQVK; (SEQ ID No: 35)DEDVLGTLHED; (SEQ ID No: 36) TESGILLPEK; (SEQ ID No: 37)VIQPAKTESGILLPEK; (SEQ ID No: 38) LIPVSVKEGDTVLLPEYGGTQVK (SEQ ID No:39) 9 (S2)  55/5.7 swiss P29685 TIAMDGTEGLVR; (SEQ ID No: 40)VVDLLAPYQR; (SEQ ID No: 41) IGLFGGAGVGK; (SEQ ID No: 42) VGLTGLTVAEYFR(SEQ ID No: 43) 10 (S4e)   62/10.2 spt O23656 (a) FGLYYVDFK; (SEQ ID No:44) EYADYVFTEYGGK; (SEQ ID No: 45) LSIAWSR; (SEQ ID No: 46)IGIAHSPAWFEPHDLK (SEQ ID No: 47) 11 (S4b)  64/8.5 spt Q42585 (a)EYADFVFQEYGGK; (SEQ ID No: 48) DFLSQGVRPSALK; (SEQ ID No: 49) FGLYYVDFK;(SEQ ID No: 50) NLNTDAFR (SEQ ID No: 51) 12 (S172)  25/4.3 spt Q9LW15(b) LDDIDFPEGPFGTK; (SEQ ID No: 52) SYYDKR (SEQ ID No: 53) 13 (S152) 32/8.5 spt BAB10927 (b) TLMNVFDK; (SEQ ID No: 54) TLMNVFDKTPNVDK; (SEQID No: 55) VFFSSSAVEYSNLAQAHATENAK (SEQ ID No: 56 14 (S18)  36/4.5 sptQ9LTJ7 (b) TVVDKSDDAPAETVLK; (SEQ ID No: 57) QYDGSDPQKPLLMAIK (SEQ IDNo: 58) 15 (S106)  80/4.8 spt 19S7E7 (b) FWDNFGK; (SEQ ID No: 59)FGWSANMER; (SEQ ID No: 60) YLSVTNPELSK; (SEQ ID No: 61) IYEMMDVALSGK;(SEQ ID No: 62) EVTTAEYNEFYR; (SEQ ID No: 63) AQSTGDTISLDYMK (SEQ ID No:64) 16 (S104)  90/5.2 spt BAB09837 (b) IFGEDFLNDK; (SEQ ID No: 65)SIDSLVITK; (SEQ ID No: 66) YFFDGEIQSDKIK; (SEQ ID No: 67) VLTEFQEAAK;(SEQ ID No: 68) YFFDGEIQSDK; (SEQ ID No: 69) IDATEENELAQEYR; (SEQ ID No:70) AEDDVNFYQTVNPDVAK (SEQ ID No: 71) 17 (S93)  21/7.9 spt Q9M7T0 (a, b)LAEGTDITSAAPGVSLQK; (SEQ ID No: 72) AVNVEEAPSDFK; (SEQ ID No: 73)WSAYVEDGKVK; (SEQ ID No: 74) SKLAEGTDITSAAPGVSLQK (SEQ ID No: 75) 18(S36)  34/8.7 spt AAG12816 (a, b) NYINLAQIHASENSK; (SEQ ID No: 76)SFEQIEVER; (SEQ ID No: 77) INLAQIHASENSK; (SEQ ID No: 78) AIYTVGNWIR(SEQ ID No: 79) 19 (S7)  56/7.3 spt Q9SGA7 (a, b) IPTAELFAR; (SEQ ID No:80) RIPTAELFAR; (SEQ ID No: 81) ALEEEIEDIGGHLNAYTSR (SEQ ID No: 82)TILGPAQNVK; (SEQ ID No: 83) LSSDPTTSQLVANEPASFTGSEVR (SEQ ID No: 84) 20(S1) 100/5.9 spt Q9SSG3 (a, b) SASITGGYFYR (SEQ ID No: 85)

[0236] TABLE 2 Statistics of human genome searches with Peptide SequenceTags No of No of Hits in Unique identification after AA in tag samplesgenome Range data verification 2 15 7.3 1-22 yes for all 3 45 2.1 1-10yes for all 4 21 1.1 1-2  yes for all 5 2 1.0 1-1  yes for all

[0237] Statistics on a number of searches with peptide sequence tagswith various lengths. Peptides were at least 10 amino acids long. TABLE3 Mass spectrometric identification of peptides of a new human proteinin the human genome database Peptide sequence tags were constructed fromnanoelectrospray tandem mass spectra and searched against the humangenome database and the retrieved peptide sequences are listed. Sequenceidentified Peptide sequence tag in human genome (642.49)VS(99.05)VSVGLLLVK (SEQ ID No: 86) (634.38)FAV(246.15) VFVAFQAFLR (SEQ ID No: 87)(1347.68)TTSF(163.08) YFSTTEDYDHEITGLR (SEQ ID No: 88) (SEQ ID No: 89)(807.45)QLT(1093.57) LGALGGNTQEVTLQPGEYITK (SEQ ID No: 90)

Example 2

[0238] Human proteins obtained from 1D or 2D PAGE gels were digested ingel and the resulting peptide mixtures analyzed by MALDI peptide massmapping (Bruker Reflex III). Peptides were also sequenced bynanoelectrospray on a quadrupole TOF MS (QSTAR, PE Sciex). Peptidesequence tags and peptide mass maps were searched at 50 ppm massaccuracy against publicly available sequences of the human genome (ca.80% coverage, NCBI) using the program ID program (MDS Protana). Using aPC cluster consisting of 12 members, searches in the human genomerequired 75 s CPU time. Further analysis of identified coding regionswas performed using the gene prediction programs Grail, HMMgene andGenscan.

[0239] Peptide sequences correspond to coding regions within a gene.Whenever a peptide sequence tag derived from a MS/MS spectrumunambiguously identifies the corresponding DNA sequence in the genome,this sequence must be part of an exon. The peptide therefore locates theexon as well as the correct reading frame. In-frame stop codons upstreamand downstream of the identified peptide also limit the extent of theexon within which the splice signals (exon intron boundaries) must befound. Mass spectral data can be used to screen the vicinity of mappedregions for further exons. In many cases, peptides span two exons whichenables the localization of the exact splice site for the two exonsinvolved.

[0240] Typically, several peptides are partially sequenced during thecourse of a protein identification experiment using nanoES tandem MS.Subsequent database searches identify peptides which cluster in aconfined (2-15 kb) region of the genome which encompasses the underlyinggene. The identified peptides define reading frames which in turn holdinformation about the intron/exon structure of the gene. Generally, twopeptides are sufficient to identify and map the respective gene to itschromosomal location. Any of the identified exons can be used as probesfor cloning or for homology searching for tentative function assignment.The defined genome area can be used to direct sequencing of furtherpeptides in the same experiment.

[0241] Most strategies for large scale protein identification follow atwo tier analytical approach in which first a MALDI peptide massfingerprint is created and samples that are not identified in this roundof analysis are subjected to partial sequencing by tandem MS. It shouldbe stressed that, owing to the complex structure of genes in higherorganisms, fingerprint data alone does not hold sufficientdiscriminating power to identify proteins directly in a genome. However,once part of the coding sequence of a protein has been found in thegenome by peptide sequence tags, the 2-15 kb genomic sequence can besearched with the fingerprinting data by translating the nucleotidesequence in the three respective reading frames. Thereby, exon sequencecoverage can be extended and additional exons can sometimes be found.

[0242] Computational gene prediction in genomic DNA of higher organismsis very difficult but a combination of MS data and exon prediction canbe very effective at defining gene structure. The genomic regionidentified in the previous figure was analyzed with GENSCAN, and GRAILand compared to the known sequence of this protein. GENSCAN missed oneexon and predicted a surplus one whereas GRAIL predicted two splicesites incorrectly. MS data, in conjunction with the genome sequence,rectified the incorrect splice sites, led to inclusion of the exon thatwas missed and showed that the surplus exon was not present. The extentto which a predicted gene model can be verified or refined by the MSdata obviously depends on the number of exons actually identified bypeptide sequence tags.

[0243] The size of the human genome is approximately 25 times that of A.thaliana but the coding sequence is expected to be only 2-3 timeslarger. Tryptic peptides of the size typically encountered in MSsequencing (>10 aa) are almost always unique in the human genome. Theinformation content of peptide sequence tags approximates that of thecomplete peptide sequence. In addition, the sequences retrieved by thesearch are checked against the tandem MS data which eliminates falsepositives. Therefore, searches using even short tags almost alwaysresult in unique identifications. Interestingly, the search specificityin the human genome is virtually identical to that of the dbEST but withthe added advantage of high sequence accuracy, low redundancy andunbiased coverage.

1 112 1 16 PRT Arabidopsis thaliana 1 Thr Phe Asp Glu Ser Lys Glu ThrIle Asn Lys Glu Ile Glu Glu Lys 1 5 10 15 2 9 PRT Arabidopsis thaliana 2Phe Gln Ala Ala Val Asp Ile Leu Arg 1 5 3 16 PRT Arabidopsis thaliana 3Ile Lys His Asp Ile Asp Thr Glu Thr Gln Asp Ile Pro Asp Ala Arg 1 5 1015 4 12 PRT Arabidopsis thaliana 4 Ile Thr Leu Asp Pro Glu Asp Pro AlaAla Val Lys 1 5 10 5 6 PRT Arabidopsis thaliana 5 Val Phe Phe Asp IleLys 1 5 6 18 PRT Arabidopsis thaliana 6 Ala Gln Leu Asp Glu Leu Lys SerAsp Ala Val Glu Ala Met Glu Ser 1 5 10 15 Gln Lys 7 15 PRT Arabidopsisthaliana 7 Ser Asp Lys Lys Gly Met Asp Leu Leu Val Ala Glu Phe Glu Lys 15 10 15 8 17 PRT Arabidopsis thaliana 8 Lys Glu Asp Leu Pro Lys Tyr GluGlu Asn Leu Glu Leu Ser Met Ala 1 5 10 15 Lys 9 7 PRT Arabidopsisthaliana 9 Tyr Leu Glu Glu Leu Val Lys 1 5 10 10 PRT Arabidopsisthaliana 10 Val Ser Val Phe Leu Pro Glu Glu Val Lys 1 5 10 11 13 PRTArabidopsis thaliana 11 Val Val Glu Thr Tyr Glu Ala Thr Ser Ala Glu ValLys 1 5 10 12 14 PRT Arabidopsis thaliana 12 Glu Ile Pro Val Glu Glu ValLys Ala Glu Glu Pro Ala Lys 1 5 10 13 14 PRT Arabidopsis thaliana 13 AlaIle Ala Phe Asp Glu Ile Asp Lys Ala Pro Glu Glu Lys 1 5 10 14 10 PRTArabidopsis thaliana 14 Phe Pro Gly Asp Asp Ile Pro Ile Ile Arg 1 5 1015 11 PRT Arabidopsis thaliana 15 Val Gly Glu Glu Val Glu Ile Leu GlyLeu Arg 1 5 10 16 16 PRT Arabidopsis thaliana 16 Gly Ser Ala Leu Ser AlaLeu Gln Gly Thr Asn Asp Glu Ile Gly Arg 1 5 10 15 17 14 PRT Arabidopsisthaliana 17 Leu Met Asp Ala Val Asp Glu Tyr Ile Pro Asp Pro Val Arg 1 510 18 9 PRT Arabidopsis thaliana 18 Thr Leu Val Phe Gln Phe Ser Val Lys1 5 19 14 PRT Arabidopsis thaliana 19 Phe Tyr Ala Ile Ser Ala Glu PhePro Glu Phe Ser Asn Lys 1 5 10 20 16 PRT Arabidopsis thaliana 20 Phe TyrAla Ile Ser Ala Glu Phe Pro Glu Phe Ser Asn Lys Asp Lys 1 5 10 15 21 14PRT Arabidopsis thaliana 21 Ala Val Val Thr Val Pro Ala Tyr Phe Asn AspAla Gln Arg 1 5 10 22 13 PRT Arabidopsis thaliana 22 Glu Val Asp Glu ValLeu Leu Val Gly Gly Met Thr Arg 1 5 10 23 20 PRT Arabidopsis thaliana 23Gly Val Asn Pro Asp Glu Ala Val Ala Met Gly Ala Ala Ile Gln Gly 1 5 1015 Gly Ile Leu Arg 20 24 9 PRT Arabidopsis thaliana 24 Leu Val Pro TyrGln Ile Val Asn Lys 1 5 25 12 PRT Arabidopsis thaliana 25 Asp Ala GlyVal Ile Ala Gly Leu Asn Val Ala Arg 1 5 10 26 11 PRT Arabidopsisthaliana 26 Phe Asp Leu Thr Gly Val Pro Pro Ala Pro Arg 1 5 10 27 10 PRTArabidopsis thaliana 27 Phe Glu Glu Leu Asn Asn Asp Leu Phe Arg 1 5 1028 7 PRT Arabidopsis thaliana 28 Ile Met Glu Tyr Phe Ile Lys 1 5 29 8PRT Arabidopsis thaliana 29 Ile Leu Leu Glu Ser Ala Ile Arg 1 5 30 12PRT Arabidopsis thaliana 30 Thr Ser Leu Ala Pro Gly Ser Gly Val Val ThrLys 1 5 10 31 11 PRT Arabidopsis thaliana 31 Phe Tyr Ser Leu Pro Ala LeuAsn Asp Pro Arg 1 5 10 32 14 PRT Arabidopsis thaliana 32 Val Val Asn PheSer Phe Asp Gly Gln Pro Ala Glu Leu Lys 1 5 10 33 15 PRT Arabidopsisthaliana 33 Ser Glu Asn Ala Val Gln Ala Asn Met Glu Leu Glu Phe Gln Arg1 5 10 15 34 9 PRT Arabidopsis thaliana 34 Val Ile Ala Val Gly Pro GlySer Arg 1 5 35 16 PRT Arabidopsis thaliana 35 Glu Gly Asp Thr Val LeuLeu Pro Glu Tyr Gly Gly Thr Gln Val Lys 1 5 10 15 36 11 PRT Arabidopsisthaliana 36 Asp Glu Asp Val Leu Gly Thr Leu His Glu Asp 1 5 10 37 10 PRTArabidopsis thaliana 37 Thr Glu Ser Gly Ile Leu Leu Pro Glu Lys 1 5 1038 16 PRT Arabidopsis thaliana 38 Val Ile Gln Pro Ala Lys Thr Glu SerGly Ile Leu Leu Pro Glu Lys 1 5 10 15 39 23 PRT Arabidopsis thaliana 39Leu Ile Pro Val Ser Val Lys Glu Gly Asp Thr Val Leu Leu Pro Glu 1 5 1015 Tyr Gly Gly Thr Gln Val Lys 20 40 12 PRT Arabidopsis thaliana 40 ThrIle Ala Met Asp Gly Thr Glu Gly Leu Val Arg 1 5 10 41 10 PRT Arabidopsisthaliana 41 Val Val Asp Leu Leu Ala Pro Tyr Gln Arg 1 5 10 42 11 PRTArabidopsis thaliana 42 Ile Gly Leu Phe Gly Gly Ala Gly Val Gly Lys 1 510 43 13 PRT Arabidopsis thaliana 43 Val Gly Leu Thr Gly Leu Thr Val AlaGlu Tyr Phe Arg 1 5 10 44 9 PRT Arabidopsis thaliana 44 Phe Gly Leu TyrTyr Val Asp Phe Lys 1 5 45 13 PRT Arabidopsis thaliana 45 Glu Tyr AlaAsp Tyr Val Phe Thr Glu Tyr Gly Gly Lys 1 5 10 46 7 PRT Arabidopsisthaliana 46 Leu Ser Ile Ala Trp Ser Arg 1 5 47 16 PRT Arabidopsisthaliana 47 Ile Gly Ile Ala His Ser Pro Ala Trp Phe Glu Pro His Asp LeuLys 1 5 10 15 48 13 PRT Arabidopsis thaliana 48 Glu Tyr Ala Asp Phe ValPhe Gln Glu Tyr Gly Gly Lys 1 5 10 49 13 PRT Arabidopsis thaliana 49 AspPhe Leu Ser Gln Gly Val Arg Pro Ser Ala Leu Lys 1 5 10 50 9 PRTArabidopsis thaliana 50 Phe Gly Leu Tyr Tyr Val Asp Phe Lys 1 5 51 8 PRTArabidopsis thaliana 51 Asn Leu Asn Thr Asp Ala Phe Arg 1 5 52 14 PRTArabidopsis thaliana 52 Leu Asp Asp Ile Asp Phe Pro Glu Gly Pro Phe GlyThr Lys 1 5 10 53 6 PRT Arabidopsis thaliana 53 Ser Tyr Tyr Asp Lys Arg1 5 54 8 PRT Arabidopsis thaliana 54 Thr Leu Met Asn Val Phe Asp Lys 1 555 14 PRT Arabidopsis thaliana 55 Thr Leu Met Asn Val Phe Asp Lys ThrPro Asn Val Asp Lys 1 5 10 56 23 PRT Arabidopsis thaliana 56 Val Phe PheSer Ser Ser Ala Val Glu Tyr Ser Asn Leu Ala Gln Ala 1 5 10 15 His AlaThr Glu Asn Ala Lys 20 57 16 PRT Arabidopsis thaliana 57 Thr Val Val AspLys Ser Asp Asp Ala Pro Ala Glu Thr Val Leu Lys 1 5 10 15 58 16 PRTArabidopsis thaliana 58 Gln Tyr Asp Gly Ser Asp Pro Gln Lys Pro Leu LeuMet Ala Ile Lys 1 5 10 15 59 7 PRT Arabidopsis thaliana 59 Phe Trp AspAsn Phe Gly Lys 1 5 60 9 PRT Arabidopsis thaliana 60 Phe Gly Trp Ser AlaAsn Met Glu Arg 1 5 61 11 PRT Arabidopsis thaliana 61 Tyr Leu Ser ValThr Asn Pro Glu Leu Ser Lys 1 5 10 62 12 PRT Arabidopsis thaliana 62 IleTyr Glu Met Met Asp Val Ala Leu Ser Gly Lys 1 5 10 63 12 PRT Arabidopsisthaliana 63 Glu Val Thr Thr Ala Glu Tyr Asn Glu Phe Tyr Arg 1 5 10 64 14PRT Arabidopsis thaliana 64 Ala Gln Ser Thr Gly Asp Thr Ile Ser Leu AspTyr Met Lys 1 5 10 65 10 PRT Arabidopsis thaliana 65 Ile Phe Gly Glu AspPhe Leu Asn Asp Lys 1 5 10 66 9 PRT Arabidopsis thaliana 66 Ser Ile AspSer Leu Val Ile Thr Lys 1 5 67 13 PRT Arabidopsis thaliana 67 Tyr PhePhe Asp Gly Glu Ile Gln Ser Asp Lys Ile Lys 1 5 10 68 10 PRT Arabidopsisthaliana 68 Val Leu Thr Glu Phe Gln Glu Ala Ala Lys 1 5 10 69 11 PRTArabidopsis thaliana 69 Tyr Phe Phe Asp Gly Glu Ile Gln Ser Asp Lys 1 510 70 14 PRT Arabidopsis thaliana 70 Ile Asp Ala Thr Glu Glu Asn Glu LeuAla Gln Glu Tyr Arg 1 5 10 71 17 PRT Arabidopsis thaliana 71 Ala Glu AspAsp Val Asn Phe Tyr Gln Thr Val Asn Pro Asp Val Ala 1 5 10 15 Lys 72 18PRT Arabidopsis thaliana 72 Leu Ala Glu Gly Thr Asp Ile Thr Ser Ala AlaPro Gly Val Ser Leu 1 5 10 15 Gln Lys 73 12 PRT Arabidopsis thaliana 73Ala Val Asn Val Glu Glu Ala Pro Ser Asp Phe Lys 1 5 10 74 11 PRTArabidopsis thaliana 74 Trp Ser Ala Tyr Val Glu Asp Gly Lys Val Lys 1 510 75 20 PRT Arabidopsis thaliana 75 Ser Lys Leu Ala Glu Gly Thr Asp IleThr Ser Ala Ala Pro Gly Val 1 5 10 15 Ser Leu Gln Lys 20 76 15 PRTArabidopsis thaliana 76 Asn Tyr Ile Asn Leu Ala Gln Ile His Ala Ser GluAsn Ser Lys 1 5 10 15 77 9 PRT Arabidopsis thaliana 77 Ser Phe Glu GlnIle Glu Val Glu Arg 1 5 78 13 PRT Arabidopsis thaliana 78 Ile Asn LeuAla Gln Ile His Ala Ser Glu Asn Ser Lys 1 5 10 79 10 PRT Arabidopsisthaliana 79 Ala Ile Tyr Thr Val Gly Asn Trp Ile Arg 1 5 10 80 9 PRTArabidopsis thaliana 80 Ile Pro Thr Ala Glu Leu Phe Ala Arg 1 5 81 10PRT Arabidopsis thaliana 81 Arg Ile Pro Thr Ala Glu Leu Phe Ala Arg 1 510 82 19 PRT Arabidopsis thaliana 82 Ala Leu Glu Glu Glu Ile Glu Asp IleGly Gly His Leu Asn Ala Tyr 1 5 10 15 Thr Ser Arg 83 10 PRT Arabidopsisthaliana 83 Thr Ile Leu Gly Pro Ala Gln Asn Val Lys 1 5 10 84 25 PRTArabidopsis thaliana 84 Leu Ser Ser Asp Pro Thr Thr Thr Ser Gln Leu ValAla Asn Glu Pro 1 5 10 15 Ala Ser Phe Thr Gly Ser Glu Val Arg 20 25 8511 PRT Arabidopsis thaliana 85 Ser Ala Ser Ile Thr Gly Gly Tyr Phe TyrArg 1 5 10 86 9 PRT Homo sapiens 86 Val Ser Val Gly Leu Leu Leu Val Lys1 5 87 10 PRT Homo sapiens 87 Val Phe Val Ala Phe Gln Ala Phe Leu Arg 15 10 88 4 PRT Artificial Sequence peptide sequence tag 88 Thr Thr SerPhe 1 89 16 PRT Homo sapiens 89 Tyr Phe Ser Thr Thr Glu Asp Tyr Asp HisGlu Ile Thr Gly Leu Arg 1 5 10 15 90 21 PRT Homo sapiens 90 Leu Gly AlaLeu Gly Gly Asn Thr Gln Glu Val Thr Leu Gln Pro Gly 1 5 10 15 Glu TyrIle Thr Lys 20 91 1350 DNA Arabidopsis thaliana 91 tacgtataat aggaaacaatcaacaaggaa attgaggaga aaaagacaga actccaacca 60 aaggtcgtgg aaacctatgaagccacgtct gcagaagtca aggtcgctat catcttcttt 120 tttgttctcc aagttacattaagtaattca aatccatatt tactgtaatt aagactaaaa 180 ttaaaacgag caggctttggtgagagaccc taaggtggct ggtttgaaga aaaactcagc 240 ggctgtgcag aagtacctcgaggagctagt caaaattggt aaagattcaa ttacgctttg 300 tcatgtgtct attttaaaatatcccgacaa caatatatgc ttgtcttata tatttatcgt 360 tttttgtctt atatatcaaatatttgctgg caaaattttt ggattcttat catgtcctat 420 attcttatac gcatgttatatctctagttt ataagtcatg catatatctt tcatatagta 480 taatcaaagt tatttagcttgtatattgga gtatttatta tttttctatg aacatattat 540 atcattctca ttcaactagacgtaagtttc attcgtcata ttcactcact ttttgtttat 600 ggtctataat agtggttctctatggcttat ttatttatat gatcttgcct aagaaaattg 660 gcacattcaa aaacctcaaactattgtgga tcaaacttca tcatcttttt cagaagacgt 720 ctagagattg tattggatcactgtttcttt tgcatgattt ataattagat gaacgttgta 780 ctcatctatg tctcaaattatgaaatgttt attcagaatt ccccggatca aaagcggtga 840 gtgaggcttc gtctagcttcggagctggct atgtcgcagg accggtcacg ttcatattcg 900 agaaggtatc tgttttcctcccggaggagg tgaagactaa agaaataccg gtggaggaag 960 tgaaagctga agaacctgccaaaactgaag aaccagccaa aaccgaagga acaagtggtg 1020 agaaagagga gattgttgaagagacaaaga aaggcgagac ccctgaaacc gcggtcgtgg 1080 aggagaagaa accagaggtagaggagaaga aggaagaggc tactccggct ccggcagtgg 1140 ttgaaactcc agttaaggaaccggagacaa cgacgacagc gccagtggct gaaccaccaa 1200 agccttgatt tgttttcaagatggtatgat tacttctctt gtatgaaaac atctttgtac 1260 gtaacaaaaa aatgaaaggaagaaacggaa aggaacaaaa aaaaaaaact tatttttcat 1320 tctttttttt caatttggtttgtctttgtg 1350 92 9 PRT Arabidopsis thaliana 92 Lys Tyr Leu Glu Glu LeuVal Lys Ile 1 5 93 15 PRT Arabidopsis thaliana 93 Lys Lys Gly Glu ThrPro Glu Thr Ala Val Val Glu Glu Lys Lys 1 5 10 15 94 15 PRT Arabidopsisthaliana 94 Lys Val Val Glu Thr Tyr Glu Ala Thr Ser Ala Glu Val Lys Val1 5 10 15 95 16 PRT Arabidopsis thaliana 95 Lys Glu Ile Pro Val Glu GluVal Lys Ala Glu Glu Pro Ala Lys Thr 1 5 10 15 96 18 PRT Arabidopsisthaliana 96 Lys Thr Glu Gly Thr Ser Gly Glu Lys Glu Glu Ile Val Glu GluThr 1 5 10 15 Lys Lys 97 22 PRT Arabidopsis thaliana 97 Lys Lys Gly GluThr Pro Glu Thr Ala Val Val Glu Glu Lys Lys Pro 1 5 10 15 Glu Val GluGlu Lys Lys 20 98 26 PRT Arabidopsis thaliana 98 Lys Ala Val Ser Glu AlaSer Ser Ser Phe Gly Ala Gly Tyr Val Ala 1 5 10 15 Gly Pro Val Thr PheIle Phe Glu Lys Val 20 25 99 30 PRT Arabidopsis thaliana 99 Lys Ala GluGlu Pro Ala Lys Thr Glu Glu Pro Ala Lys Thr Glu Gly 1 5 10 15 Thr SerGly Glu Lys Glu Glu Ile Val Glu Glu Thr Lys Lys 20 25 30 100 12 PRTArabidopsis thaliana 100 Lys Met Lys Gly Arg Asn Gly Lys Glu Gln Lys Lys1 5 10 101 238 PRT Artificial Sequence Genscan analysis 101 Met Gly TyrTrp Asn Ser Lys Val Val Pro Lys Phe Lys Lys Leu Phe 1 5 10 15 Glu LysAsn Ser Ala Lys Lys Ala Ala Ala Ala Glu Ala Thr Lys Thr 20 25 30 Phe AspGlu Ser Lys Glu Thr Ile Asn Lys Glu Ile Glu Glu Lys Lys 35 40 45 Thr GluLeu Gln Pro Lys Val Val Glu Thr Tyr Glu Ala Thr Ser Ala 50 55 60 Glu ValLys Ala Leu Val Arg Asp Pro Lys Val Ala Gly Leu Lys Lys 65 70 75 80 AsnSer Ala Ala Val Gln Lys Tyr Leu Glu Glu Leu Val Lys Ile Glu 85 90 95 ValAla Ile Ile Phe Phe Phe Val Leu Gln Val Thr Leu Lys Phe Pro 100 105 110Gly Ser Lys Ala Val Ser Glu Ala Ser Ser Ser Gly Ala Gly Tyr Val 115 120125 Ala Gly Pro Val Thr Phe Ile Phe Glu Lys Val Ser Val Phe Leu Pro 130135 140 Glu Glu Val Lys Thr Lys Glu Ile Pro Val Glu Glu Val Lys Ala Glu145 150 155 160 Glu Pro Ala Lys Thr Glu Glu Pro Ala Lys Thr Glu Gly ThrSer Gly 165 170 175 Glu Lys Glu Glu Ile Val Glu Glu Thr Lys Lys Gly GluThr Pro Glu 180 185 190 Thr Ala Val Val Glu Glu Lys Lys Pro Glu Val GluGlu Lys Lys Glu 195 200 205 Glu Ala Thr Pro Ala Pro Ala Val Val Glu ThrPro Val Lys Glu Pro 210 215 220 Glu Thr Thr Thr Thr Ala Pro Val Ala GluPro Pro Lys Pro 225 230 235 102 225 PRT Arabidopsis thaliana 102 Met GlyTyr Trp Asn Ser Lys Val Val Pro Lys Phe Lys Lys Leu Phe 1 5 10 15 GluLys Asn Ser Ala Lys Lys Ala Ala Ala Ala Glu Ala Thr Lys Thr 20 25 30 PheAsp Glu Ser Lys Glu Thr Ile Asn Lys Glu Ile Glu Glu Lys Lys 35 40 45 ThrGlu Leu Gln Pro Lys Val Val Glu Thr Tyr Glu Ala Thr Ser Ala 50 55 60 GluVal Lys Ala Leu Val Arg Asp Pro Lys Val Ala Gly Leu Lys Lys 65 70 75 80Asn Ser Ala Ala Val Gln Lys Tyr Leu Glu Glu Leu Val Lys Ile Glu 85 90 95Phe Pro Gly Ser Lys Ala Val Ser Glu Ala Ser Ser Ser Phe Gly Ala 100 105110 Gly Tyr Val Ala Gly Pro Val Thr Phe Ile Phe Glu Lys Val Ser Val 115120 125 Phe Leu Pro Glu Glu Val Lys Thr Lys Glu Ile Pro Val Glu Glu Val130 135 140 Lys Ala Glu Glu Pro Ala Lys Thr Glu Glu Pro Ala Lys Thr GluGly 145 150 155 160 Thr Ser Gly Glu Lys Glu Glu Ile Val Glu Glu Thr LysLys Gly Glu 165 170 175 Thr Pro Glu Thr Ala Val Val Glu Glu Lys Lys ProGlu Val Glu Glu 180 185 190 Lys Lys Glu Glu Ala Thr Pro Ala Pro Ala ValVal Glu Thr Pro Val 195 200 205 Lys Glu Pro Glu Thr Thr Thr Thr Ala ProVal Ala Glu Pro Pro Lys 210 215 220 Pro 225 103 11 PRT Homo sapiens 103Lys Phe Leu Asp Gln Glu Glu Ala Glu Phe Leu 1 5 10 104 12 PRT Homosapiens 104 Arg Thr Val Val Met Ser Ser Glu Ala Glu Ile Phe 1 5 10 1051500 DNA Homo sapiens 105 cctccaggag agccagggac tcacccggcc cttgtcccagactaactctg gtcacagaac 60 catcctgtct gcctggaggg gtggggtccc ctgttctggcagaggtcacc cccatatcac 120 cgcatgggga ttttcttccc tttgggtctc tcttttcttcagagatgtat ggccctggag 180 gaggcaagta tttcagcacc actgaagact acgaccatgaaatcacaggg ctgcgggtgt 240 ctgtaggtct tctcctggtg aaaaggtgag tagggctatggtcatgggcc cagcgccatg 300 tcccctccca tcccacagtt tcaggaactc agggcagggggtaagcaccc gtggccactt 360 ttgccacaca tgcctggcta ctgtcgatgc ttcctggctcccgctgatgc ttcctggctg 420 gagcggagac ggtcagaccg tcctccctac cttctcccttcaacccaagc tcaactcaac 480 caaaaatggc ccctctgtcc ccatgcctga taggaaagtcaggggaaagt ctgtccgatt 540 actgtcaaag aagacaggag gtaagggtca gagtggaccactgactgaat atgagtcgca 600 gaagtgttag aggcagaagt ccagggccat ttccttaatatcgaagtgtc tctgctggag 660 gtctgggatg gatttttgcc ctgcatttag aagttctggggtcctgggag aggggagaga 720 agcccaatag cagaggagac agagtgtggg cggggcgagccggaggggtg catcctggga 780 gagcaccagg gtgagggagg ggtgaagatg agccccgtcagggaagcgct ggcgagtgtg 840 ggaagtcacc tgcccctcgg cctgtgagct gctctgcttggagtgactaa ggctcgggag 900 gtccaggctc ggccagaggc agctcatatg tgggccacagtgacggcagc tggtgccttc 960 tgggtcacgg agacctggcg ctgcacgcag ctctcctcaccaggatctca gtgactcctc 1020 ccaaaagtca cacccacttt gcagacgggg aaactgagtccggagaggct gggtaacgag 1080 ctcaagatca cagggcccaa aagtggtaga atcagggttggtgaccagtg agtctgtgtc 1140 agggacccaa agtctgatgg tgctggactc tctgcatcccgggaaggagg atgggggcgc 1200 tgaggacccg ggatgtgctg ggccatccca gatctggacgtccaaagctt tgcctctctc 1260 ccagtgtcca ggtgaaactt ggagactcct gggacgtgaaactgggagcc ttaggtggga 1320 atacccagga agtcaccctg cagccaggcg aatacatcacaaaagtcttt gtcgccttcc 1380 aagctttcct ccggggtatg gtcatgtaca ccagcaaggaccgctatttc tattttggga 1440 agcttgatgg ccagatctcc tctgcctacc ccagccaagaggggcaggtg ctggtgggca 1500 106 8 PRT Homo sapiens 106 Gly Met Val MetTyr Thr Ser Lys 1 5 107 9 PRT Homo sapiens 107 Val Ser Val Gly Leu LeuLeu Val Lys 1 5 108 8 PRT Homo sapiens 108 Asp Arg Tyr Phe Tyr Phe GlyLys 1 5 109 10 PRT Homo sapiens 109 Val Phe Val Ala Phe Gln Ala Phe LeuArg 1 5 10 110 16 PRT Homo sapiens 110 Tyr Phe Ser Thr Thr Glu Asp TyrAsp His Glu Ile Thr Gly Leu Arg 1 5 10 15 111 21 PRT Homo sapiens 111Leu Gly Ala Leu Gly Gly Asn Thr Gln Glu Val Thr Leu Gln Pro Gly 1 5 1015 Glu Tyr Ile Thr Lys 20 112 172 PRT Homo sapiens 112 Met Leu Leu LeuLeu Thr Leu Ala Leu Leu Gly Gly Pro Thr Trp Ala 1 5 10 15 Gly Lys MetTyr Gly Pro Gly Gly Gly Lys Tyr Phe Ser Thr Thr Glu 20 25 30 Asp Tyr AspHis Glu Ile Thr Gly Leu Arg Val Ser Val Gly Leu Leu 35 40 45 Leu Val LysSer Val Gln Val Lys Leu Gly Asp Ser Trp Asp Val Lys 50 55 60 Leu Gly AlaLeu Gly Gly Asn Thr Gln Glu Val Thr Leu Gln Pro Gly 65 70 75 80 Glu TyrIle Thr Lys Val Phe Val Ala Phe Gln Ala Phe Leu Arg Gly 85 90 95 Met ValMet Tyr Thr Ser Lys Asp Arg Tyr Phe Tyr Phe Gly Lys Leu 100 105 110 AspGly Gln Ile Ser Ser Ala Tyr Pro Ser Gln Glu Gly Gln Val Leu 115 120 125Val Gly Ile Tyr Gly Gln Tyr Gln Leu Leu Gly Ile Lys Ser Ile Gly 130 135140 Phe Glu Trp Asn Tyr Pro Leu Glu Glu Pro Thr Thr Glu Pro Pro Val 145150 155 160 Asn Leu Thr Tyr Ser Ala Asn Ser Pro Val Gly Arg 165 170

We claim:
 1. A method for identifying a coding sequence in a genomicdatabase, comprising: (i) generating, for an input polypeptide sequence,a set of sequence tags corresponding to possible coding sequences forthe input polypeptide sequence; and (ii) identifying, by an approximatestring matching method using said sequence tags, genomic sequences froma genomic database which are similar to one or more of the sequencetags.
 2. The method of any of claims 1, wherein the genomic database isan unannotated genomic database.
 3. The method of any of claims 1,further comprising determining an open reading frame for the inputpolypeptide sequence in the genomic database, and, optionally,determining intron/exon boundaries in the open reading frame.
 4. Themethod of any of claims 1, 2 or 3, further comprising providingannotation for the genomic database.
 5. The method of claim 1, whereinthe input polypeptide sequence is provided from a system for proteinsequencing by mass spectrometry.
 6. The method of claim 5, wherein theinput polypeptide sequence is provided by a computer which has a datalink from a mass spectrometer system for transmitting the inputpolypeptide sequence.
 7. The method of claim 1, wherein the approximatestring matching method is selected from: a Shift-And method, aKarp-Rabin fingerprint method, or a Commentz-Walter method.
 8. Themethod of claim 1, wherein the approximate string matching method is aGREP method.
 9. The method of claim 8, wherein the approximate stringmatching method is an AGREP method.
 10. The method of any of claims 1,7, 8 or 9, wherein the approximate string matching method tolerates amaximal number of errors.
 11. The method of claim 10, wherein the methodtolerates gaps for intronic sequence of a size equal to at least theaverage length of intronic sequences in the genomic database.
 12. Themethod of claim 10, wherein the error ratio, α, is less than 3.0. 13.The method of claim 10, wherein the error ratio, α, is less than 1.0.14. The method of claim 1, wherein multiple sequence tags are combinedinto a single array which is used as the input for the approximatestring matching method.
 15. A method for identifying a coding sequencein an unannotated genomic database, comprising: (i) receiving an inputpolypeptide sequence; and (ii) identifying, by an approximate stringmatching method using said input polypeptide sequence, coding sequencesfrom a genomic database which has been dynamically translated in atleast 3 reading frames.
 16. A computer system for identifying codingsequences in genomic databases, comprising: (i) a sub-system forcalculating and/or storing potential coding sequences for a polypeptide;(ii) one or more databases of genomic sequence; and (iii) an ID programfor performing approximate string matching between nucleic acidsequences in a manner which accounts for differences between the twosequences due to an intronic sequence; wherein, the system generates aset of sequence tags corresponding to possible coding sequences for aninput polypeptide sequence, and identifies, from the database, anygenomic sequences which are similar to one or more of the sequence tags,and indicates exon/intron boundaries, if any, in the genomicsequence(s).
 17. The computer system of claim 16, further including asample/identification proteomics database for logging and correlatinginformation.
 18. The computer system of claim 17, wherein saidinformation is one or more of: sample identity, gel photos, mass spectra(and features therein), and search results.
 19. The computer system ofclaim 16, further including a sub-system to automate the transfer andutilization of mass spectrometric data of a target polypeptide.
 20. Amass spectrometry system including the computer system of any of claims16-19, and a mass spectrometer for sequencing polypeptides.
 21. The massspectrometry system of claim 20, wherein the spectrometer includes anion source selected from: electrospray or MALDI.
 22. A method ofconducting a proteomics business, comprising: (i) by the method of claim1 or 15, determining the identity of a target gene encoding a proteinisolated on the basis of the protein being (a) involved in aninteraction of interest, (b) having a cellular localization of interest,(c) having a differential expression pattern of interest, or (d) beingpost-translationally modified; (ii) identifying agents by their abilityto alter the level of expression of the target gene or the activity ofan expression product of the target gene; (iii) conducting therapeuticprofiling of agents identified in step (b), or further analogs thereof,for efficacy and toxicity in animals; and (iv) formulating apharmaceutical preparation including one or more agents identified instep (iii) as having an acceptable therapeutic profile.
 23. The methodof claim 22, including an additional step of establishing a distributionsystem for distributing the pharmaceutical preparation for sale, and mayoptionally include establishing a sales group for marketing thepharmaceutical preparation.
 24. A method of conducting a proteomicsbusiness, comprising: (i) by the method of claim 1 or 15, determiningthe identity of a target gene encoding a protein isolated on the basisof the protein being (a) involved in an interaction of interest, (b)having a cellular localization of interest, (c) having a differentialexpression pattern of interest, or (d) being post-translationallymodified; (ii) (optionally) conducting therapeutic profiling of thetarget gene for efficacy and toxicity in animals; and (iii) licensing,to a third party, the rights for further drug development of inhibitorsor activators of the target gene.